Planeación estratégica basada en analítica prescriptiva
Grupo 503
Profesor Rodolfo Miguel Gameros
Equipo 7:
A00833113 - Avril Lobato
A01771127 - Lesly Darian Romero Vázquez
A00831105 - Jazmín del Carmen Cortez Mendoza
A01284611 - Lisset Hernández
El Análisis exploratorio de datos espacial involucra diferentes herramientas para visualizar los patrones espaciales existentes en la base de datos. Estos datos se pueden segregar o descomponer los patrones espaciales en diferentes elementos a lo largo des mapa, y a su vez medir la extensión de estas autocorrelaciones espaciales. El principal paquete de R que se utiliza para este propósito es ‘geostan’.
A nivel estatal
## state year state_id crime_rate unemployment employment
## 1 Aguascalientes 2021 1057 6.75 0.04 0.97
## 2 Baja California 2021 2304 84.67 0.01 0.98
## 3 Baja California Sur 2021 2327 8.52 0.03 0.97
## 4 Campeche 2021 1086 9.22 0.02 0.98
## 5 Chiapas 2021 1182 10.01 0.05 0.97
## 6 Chihuahua 2021 888 71.98 0.04 0.97
## business_activity real_wage real_ave_month_income pop_density lq_primary
## 1 -1.90 361.02 5641.67 261.21 0.16
## 2 2.47 388.22 7599.62 53.19 0.47
## 3 -2.12 345.57 8660.90 11.27 0.73
## 4 -2.44 414.48 5357.29 16.42 0.73
## 5 -2.41 312.37 6581.28 77.16 1.56
## 6 -1.25 362.93 5708.75 15.30 0.64
## lq_secondary lq_tertiary new_fdi_real_mxn log_new_fdi_real_mxn region_a
## 1 1.24 1.00 1270.03 3.10 Bajio
## 2 1.62 0.86 20407.81 4.31 Norte
## 3 0.51 1.13 19022.55 4.28 Occidente
## 4 0.78 1.06 2390.64 3.38 Sur
## 5 0.88 0.99 189.58 2.28 Sur
## 6 1.97 0.80 7153.63 3.86 Norte
## region_b
## 1 2
## 2 3
## 3 4
## 4 5
## 5 5
## 6 3
## Rows: 32
## Columns: 17
## $ state <chr> "Aguascalientes", "Baja California", "Baja Calif…
## $ year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, …
## $ state_id <int> 1057, 2304, 2327, 1086, 1182, 888, 1114, 933, 11…
## $ crime_rate <dbl> 6.75, 84.67, 8.52, 9.22, 10.01, 71.98, 11.58, 4.…
## $ unemployment <dbl> 0.04, 0.01, 0.03, 0.02, 0.05, 0.04, 0.06, 0.04, …
## $ employment <dbl> 0.97, 0.98, 0.97, 0.98, 0.97, 0.97, 0.94, 0.94, …
## $ business_activity <dbl> -1.90, 2.47, -2.12, -2.44, -2.41, -1.25, -2.08, …
## $ real_wage <dbl> 361.02, 388.22, 345.57, 414.48, 312.37, 362.93, …
## $ real_ave_month_income <dbl> 5641.67, 7599.62, 8660.90, 5357.29, 6581.28, 570…
## $ pop_density <dbl> 261.21, 53.19, 11.27, 16.42, 77.16, 15.30, 6195.…
## $ lq_primary <dbl> 0.16, 0.47, 0.73, 0.73, 1.56, 0.64, 0.03, 0.17, …
## $ lq_secondary <dbl> 1.24, 1.62, 0.51, 0.78, 0.88, 1.97, 0.58, 1.55, …
## $ lq_tertiary <dbl> 1.00, 0.86, 1.13, 1.06, 0.99, 0.80, 1.14, 0.93, …
## $ new_fdi_real_mxn <dbl> 1270.03, 20407.81, 19022.55, 2390.64, 189.58, 71…
## $ log_new_fdi_real_mxn <dbl> 3.10, 4.31, 4.28, 3.38, 2.28, 3.86, 4.54, 3.80, …
## $ region_a <chr> "Bajio", "Norte", "Occidente", "Sur", "Sur", "No…
## $ region_b <int> 2, 3, 4, 5, 5, 3, 1, 3, 4, 4, 2, 5, 1, 4, 1, 4, …
Seguidamente, se leen los archivos de las formas geométricas (geocercas) que determinarán los mapas a realizar después para visualizar los datos.
A nivel estado (32)
## Reading layer `mexlatlong' from data source
## `C:\Users\AVRIL\Downloads\mexlatlong.shp' using driver `ESRI Shapefile'
## Simple feature collection with 32 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -118.4042 ymin: 14.55055 xmax: -86.73862 ymax: 32.71846
## CRS: NA
### Basic Map Making
state_geodata <- geo_join(mx_state_map,mx_state,'OBJECTID','state_id',how='inner')
plot(state_geodata)
tm_shape(mx_state_map) +
tm_polygons(col = "black") +
tm_compass(position = c("left", "bottom")) +
tm_text("ADMIN_NAME", size = "AREA") +
tm_layout(title = "Estados de México")
Ahora sí, mezclaremos la data de las geocercas con alguna de las variables que vienen en la tabla de datos inicial para su visualización en los mapas:
businessactivity <- tm_shape(state_geodata) +
tm_polygons(
col = "business_activity",
palette = "-RdBu", # el signo "-" invierte los colores para que los negativos estén en azul
style = "cont",
title = "Business Activity"
) +
tm_layout(
main.title = "Business Activity",
title.position = c("right", "top"),
legend.position = c("left", "bottom"),
title.size = 1
)
unemployment <- tm_shape(state_geodata) +
tm_polygons(
col = "unemployment",
palette = "viridis", # sin el "-" para que los valores más bajos sean claros
style = "quantile",
n = 8,
title = "Unemployment"
) +
tm_layout(
main.title = "Unemployment",
title.position = c("right", "top"),
legend.position = c("left", "bottom"),
title.size = 1
)
## Neighbour list object:
## Number of regions: 32
## Number of nonzero links: 138
## Percentage nonzero weights: 13.47656
## Average number of links: 4.3125
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9
## 1 6 6 6 5 2 3 2 1
## 1 least connected region:
## 31 with 1 link
## 1 most connected region:
## 8 with 9 links
sswm <- nb2listw(swm, style="W", zero.policy = TRUE)
mx_state_map_a <- as(mx_state_map, "Spatial")
mx_state_map_centroid <- coordinates(mx_state_map_a)
plot(mx_state_map_a,border="blue",axes=FALSE,las=1, main="Mexico's States Queen SWM")
plot(mx_state_map_a,col="grey",border=grey(0.9),axes=T,add=T)
plot(sswm,coords=mx_state_map_centroid,pch=19,cex=0.1,col="red",add=T)
moran.test(state_geodata$unemployment, sswm) # Global Moran's I is 0.021 but not statistically significant (p-value > 10%).
##
## Moran I test under randomisation
##
## data: state_geodata$unemployment
## weights: sswm
##
## Moran I statistic standard deviate = 0.43866, p-value = 0.3305
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.02162698 -0.03225806 0.01508982
moran.test(state_geodata$business_activity, sswm) # Global Moran's I is 0.18 but statistically significant (p-value > 10%).
##
## Moran I test under randomisation
##
## data: state_geodata$business_activity
## weights: sswm
##
## Moran I statistic standard deviate = 2.6024, p-value = 0.004629
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.184181268 -0.032258065 0.006917295
Presentación de resultados en formato tabla
table <- data.frame(Variable = c("Unemployment", "Business Activity"), GM = c(0.021, 0.18), Significance = c("NS","***"))
table
## Variable GM Significance
## 1 Unemployment 0.021 NS
## 2 Business Activity 0.180 ***
Cálculo de lags de acuerdo con la matriz de conectividad antes creada
state_geodata$sp_lag_unemployment <- lag.listw(sswm, state_geodata$unemployment, zero.policy=TRUE)
state_geodata$sp_lag_business_activity <- lag.listw(sswm, state_geodata$business_activity, zero.policy=TRUE)
unemployment_lag <- tm_shape(state_geodata) +
tm_polygons(col = "sp_lag_unemployment",
palette="Greens", style="quantile",
n=8,
title="Unemployment") +
tm_layout(main.title= 'Clusters of Unemployment',
title.position = c('right', 'top'),
legend.position= c("left", "bottom"),
title.size = 1)
ba_lag <- tm_shape(state_geodata) +
tm_polygons(
col = "sp_lag_business_activity",
palette = "OrRd",
style = "quantile",
n = 8,
midpoint = NA, # <-- evita forzar el punto medio en 0
title = "Business Activity (Lag-1)"
) +
tm_layout(
main.title = "Clusters of Business Activity (Lag-1)",
title.position = c("right", "top"),
legend.position = c("left", "bottom"),
title.size = 1
)
tmap_arrange(businessactivity, ba_lag, unemployment, unemployment_lag, ncol = 2)
# Create a regression model
M1 <- lm(sp_lag_business_activity ~ business_activity, state_geodata)
# Plot the data
plot(sp_lag_business_activity ~ business_activity, state_geodata, pch=21, asp=1, las=1, col = "grey40", bg="grey80", main="Business Activity")
abline(M1, col="blue") # Add the regression line from model M
abline(v = mean(state_geodata$business_activity), lty=3, col = "grey80")
abline(h = mean(state_geodata$business_activity), lty=3, col = "grey80")
# Create a regression model
M2 <- lm(sp_lag_unemployment ~ unemployment, state_geodata)
# Plot the data
plot(sp_lag_unemployment ~ unemployment, state_geodata, pch=21, asp=1, las=1, col = "grey40", bg="grey80", main="Unemployment")
abline(M2, col="blue") # Add the regression line from model M
abline(v = mean(state_geodata$crime_rate), lty=3, col = "grey80")
abline(h = mean(state_geodata$crime_rate), lty=3, col = "grey80")
Segregación en las diferentes posibilidades de agrupaciones: + No significante + HotSpots: High - High + ColdSpots: Low - Low + Atípicos: High - Low + Atípicos: Low - High
sswm_a <- queen_weights(mx_state_map) # queen spatial weight matrix (alternative format)
lisa_income <- local_moran(sswm_a, state_geodata["unemployment"])
state_geodata$cluster_unemployment <- as.factor(lisa_income$GetClusterIndicators())
levels(state_geodata$cluster_unemployment)<-lisa_income$GetLabels()
ggplot(data=state_geodata) +
geom_sf(aes(fill=cluster_unemployment)) +
ggtitle(label = "Unemployment", subtitle = "Mexico's States")
Cargamos algunas librerías adicionales que se utilizan para el ajuste de modelos de regresión:
Primero recordemos los modelos de regresión tradicional (no espaciales)
model_a <- lm(new_fdi_real_mxn ~ business_activity + unemployment, data = state_geodata)
summary(model_a)
##
## Call:
## lm(formula = new_fdi_real_mxn ~ business_activity + unemployment,
## data = state_geodata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8473 -4523 -2443 1047 26134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12176 4971 2.449 0.0206 *
## business_activity 4695 1722 2.727 0.0107 *
## unemployment 97524 124597 0.783 0.4401
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8077 on 29 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1494
## F-statistic: 3.722 on 2 and 29 DF, p-value: 0.03641
## [1] 671.4531
Aquí se busca explicar la variable de Inversión Extranjera Directa considerando las variables de Actividad Empresarial, Promedio de Ingresos mensuales reales, Tasa de Criminalidad y Densidad Poblacional.
El modelo_a, aunque significativo, realmente no es tan bueno, pues la \(R^2\) ajustada apenas sobrepasa el 0.5 (\(\bar{R^2}\) = 0.5821).
Ahora buscaremos dar una mayor complejidad a los modelos previos al considerar terminos de autocorrelación espacial, es decir
También conocido como Spatial Lag Model
model_b <- lagsarlm(new_fdi_real_mxn ~ business_activity + unemployment, data = state_geodata, listw = sswm)
summary(model_b)
##
## Call:lagsarlm(formula = new_fdi_real_mxn ~ business_activity + unemployment,
## data = state_geodata, listw = sswm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8378.93 -4533.51 -2475.15 922.67 26266.71
##
## Type: lag
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11811.5 5539.3 2.1323 0.032982
## business_activity 4623.7 1740.1 2.6572 0.007879
## unemployment 97513.3 118721.1 0.8214 0.411438
##
## Rho: 0.032956, LR test value: 0.017546, p-value: 0.89462
## Approximate (numerical Hessian) standard error: 0.26295
## z-value: 0.12534, p-value: 0.90026
## Wald statistic: 0.015709, p-value: 0.90026
##
## Log likelihood: -331.7178 for lag model
## ML residual variance (sigma squared): 59069000, (sigma: 7685.6)
## Number of observations: 32
## Number of parameters estimated: 5
## AIC: 673.44, (AIC for lm: 671.45)
## [1] 673.4356
Esencial, incluir la matriz de conectividad espacial dada en el parámetro ‘listw’ por el objeto previamente calculado con el nombre de ‘sswm’
Modelo de Error Espacial
model_c <- errorsarlm(new_fdi_real_mxn ~ business_activity + unemployment, data = state_geodata, listw = sswm)
summary(model_c)
##
## Call:errorsarlm(formula = new_fdi_real_mxn ~ business_activity + unemployment,
## data = state_geodata, listw = sswm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9664.0 -4067.2 -2083.9 1860.1 25175.2
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13904.6 4153.0 3.3481 0.0008137
## business_activity 5786.5 1386.1 4.1746 2.985e-05
## unemployment 100452.6 110678.7 0.9076 0.3640865
##
## Lambda: -0.39762, LR test value: 1.7915, p-value: 0.18074
## Asymptotic standard error: 0.23953
## z-value: -1.66, p-value: 0.096917
## Wald statistic: 2.7556, p-value: 0.096917
##
## Log likelihood: -330.8308 for error model
## ML residual variance (sigma squared): 53778000, (sigma: 7333.3)
## Number of observations: 32
## Number of parameters estimated: 5
## AIC: 671.66, (AIC for lm: 671.45)
## [1] 671.6616
Igual que en el modelo SAR es esencial, incluir la matriz de conectividad espacial dada en el parámetro ‘listw’ por el objeto previamente calculado con el nombre de ‘sswm’
model_d <- lagsarlm(new_fdi_real_mxn ~ business_activity + unemployment, data = state_geodata, listw = sswm, type="mixed")
summary(model_d)
##
## Call:lagsarlm(formula = new_fdi_real_mxn ~ business_activity + unemployment,
## data = state_geodata, listw = sswm, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -9395.9 -4069.3 -2000.3 2234.7 25838.6
##
## Type: mixed
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16582.9 10206.0 1.6248 0.104201
## business_activity 4471.6 1664.1 2.6871 0.007208
## unemployment 105464.9 110826.1 0.9516 0.341287
## lag.business_activity 4018.2 2206.8 1.8208 0.068638
## lag.unemployment 97809.4 261432.2 0.3741 0.708308
##
## Rho: -0.23256, LR test value: 0.6703, p-value: 0.41295
## Approximate (numerical Hessian) standard error: 0.28212
## z-value: -0.82434, p-value: 0.40974
## Wald statistic: 0.67954, p-value: 0.40974
##
## Log likelihood: -329.7602 for mixed model
## ML residual variance (sigma squared): 51591000, (sigma: 7182.7)
## Number of observations: 32
## Number of parameters estimated: 7
## AIC: 673.52, (AIC for lm: 672.19)
## [1] 673.5205
Igual que en los modelo SAR y SEM es esencial, incluir la matriz de conectividad espacial dada en el parámetro ‘listw’ por el objeto previamente calculado con el nombre de ‘sswm’ y aquí en particular se debe añadir el parámetro ‘type’ con valor ‘mixed’ para que en esencia mezcle los dos modelos previos (SAR y SEM).
##
## Estimated Regression Results
## =======================================================================================
## Dependent variable:
## -----------------------------------------------------------------
## new_fdi_real_mxn
## OLS spatial spatial spatial
## autoregressive error autoregressive
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------
## business_activity 4,695.418** 4,623.750*** 5,786.464*** 4,471.587***
## (1,722.100) (1,740.075) (1,386.100) (1,664.106)
##
## unemployment 97,523.970 97,513.350 100,452.600 105,464.900
## (124,597.400) (118,721.100) (110,678.700) (110,826.100)
##
## lag.business_activity 4,018.200*
## (2,206.836)
##
## lag.unemployment 97,809.420
## (261,432.200)
##
## Constant 12,175.640** 11,811.500** 13,904.580*** 16,582.930
## (4,971.108) (5,539.305) (4,152.975) (10,206.020)
##
## ---------------------------------------------------------------------------------------
## Observations 32 32 32 32
## R2 0.204
## Adjusted R2 0.149
## Log Likelihood -331.718 -330.831 -329.760
## sigma2 59,068,556.000 53,777,605.000 51,590,758.000
## Akaike Inf. Crit. 673.436 671.662 673.520
## Residual Std. Error 8,076.685 (df = 29)
## F Statistic 3.722** (df = 2; 29)
## Wald Test (df = 1) 0.016 2.756* 0.680
## LR Test (df = 1) 0.018 1.792 0.670
## =======================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
What is exploratory data analysis? https://www.ibm.com/topics/exploratory-data-analysis
Exploratory Spatial Data Analysis https://cran.r-project.org/web/packages/geostan/vignettes/measuring-sa.html