QUIZ 4 - Presentación
Objetivo: Mediante el uso de fuentes de datos oficiales (INEGI, Banxico, y SECTUR) desarrollar y organizar una presentación de analítica espacial de datos con un enfoque en el sector turismo de los estados de México. Además de la visualización y descripción de gráficos y mapas y la estimación de modelos de regresión lineal, también se busca hacer una correcta selección e interpretación de los resultados estimados.
La Analítica Espacial de Datos (AED) es una técnica que combina métodos estadísticos con datos geográficos para identificar patrones espaciales en fenómenos sociales, económicos o ambientales. Permite analizar cómo la localización afecta los resultados observados y es clave para comprender dinámicas que varían en el espacio.
La AED permite visualizar y comprender los flujos turísticos, así como identificar “hotspots” o zonas con alto potencial de crecimiento. Esto apoya la toma de decisiones estratégicas en políticas públicas, inversión privada e infraestructura turística.
A partir de la llegada de nuevos turistas por eventos internacionales como el Mundial de Fútbol 2026, este análisis busca maximizar las oportunidades actuales y anticipar tendencias futuras en el sector turismo. El estudio se enfoca en segmentar la República en clústeres turísticos, entender el comportamiento de los visitantes en zonas clave como Cancún y evaluar el sentimiento del turista con base en datos digitales.
# Instala estos paquetes solo si no los tienes
# install.packages(c("readxl", "dplyr", "ggplot2"))
# Carga los paquetes
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringi)
library(dplyr)
library(ggplot2)
library(readxl)
library(scales)
## Warning: package 'scales' was built under R version 4.3.2
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(tmap)
library(tmaptools)
## Warning: package 'tmaptools' was built under R version 4.3.2
library(spdep)
## Warning: package 'spdep' was built under R version 4.3.3
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.3.3
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(spatialreg)
## Warning: package 'spatialreg' was built under R version 4.3.2
## Loading required package: Matrix
##
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
##
## get.ClusterOption, get.coresOption, get.mcOption,
## get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
## set.coresOption, set.mcOption, set.VerboseOption,
## set.ZeroPolicyOption
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(GWmodel)
## Warning: package 'GWmodel' was built under R version 4.3.3
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 4.3.3
## Loading required package: sp
## Loading required package: Rcpp
## Welcome to GWmodel version 2.3-2.
df<- read_excel("C:\\Users\\Diego Pérez\\Downloads\\mx_spatial_data\\tourism_state_data.xlsx")
## New names:
## • `region` -> `region...17`
## • `region` -> `region...18`
View(df)
head(df)
## # A tibble: 6 × 18
## state year state_id tourism_gdp crime_rate college_education unemployment
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aguascal… 2006 1057 11995. 2.32 0.164 0.05
## 2 Baja Cal… 2006 2304 54372. 15.5 0.181 0.03
## 3 Baja Cal… 2006 2327 24238. 4.54 0.191 0.02
## 4 Campeche 2006 1086 13032. 4.19 0.154 0.01
## 5 Chiapas 2006 1182 22730. 11.6 0.0875 0.04
## 6 Chihuahua 2006 888 37096. 19.4 0.153 0.04
## # ℹ 11 more variables: employment <dbl>, business_activity <dbl>,
## # real_wage <dbl>, pop_density <dbl>, good_governance <dbl>,
## # ratio_public_investment <dbl>, exchange_rate <dbl>, inpc <dbl>,
## # border_distance <dbl>, region...17 <chr>, region...18 <dbl>
colnames(df)
## [1] "state" "year"
## [3] "state_id" "tourism_gdp"
## [5] "crime_rate" "college_education"
## [7] "unemployment" "employment"
## [9] "business_activity" "real_wage"
## [11] "pop_density" "good_governance"
## [13] "ratio_public_investment" "exchange_rate"
## [15] "inpc" "border_distance"
## [17] "region...17" "region...18"
# Renombrar explícitamente la columna correcta
df <- df %>%
rename(region = `region...17`) # nombres como "Norte", "Sur", etc.
# Verifica que el cambio se hizo
colnames(df)
## [1] "state" "year"
## [3] "state_id" "tourism_gdp"
## [5] "crime_rate" "college_education"
## [7] "unemployment" "employment"
## [9] "business_activity" "real_wage"
## [11] "pop_density" "good_governance"
## [13] "ratio_public_investment" "exchange_rate"
## [15] "inpc" "border_distance"
## [17] "region" "region...18"
# Estadísticas generales para tourism_gdp
summary(df$tourism_gdp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6240 22685 32482 56520 59014 472642
sd(df$tourism_gdp, na.rm = TRUE)
## [1] 72902.39
# Estadísticas agrupadas por región
df %>%
group_by(region) %>%
summarise(
count = n(),
mean = mean(tourism_gdp, na.rm = TRUE),
median = median(tourism_gdp, na.rm = TRUE),
sd = sd(tourism_gdp, na.rm = TRUE),
min = min(tourism_gdp, na.rm = TRUE),
max = max(tourism_gdp, na.rm = TRUE)
)
## # A tibble: 5 × 7
## region count mean median sd min max
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bajio 85 44905. 25257. 50387. 7788. 181764.
## 2 Centro 102 114667. 42143. 136630. 6838. 472642.
## 3 Norte 102 48258. 36270. 28001. 19769. 127409.
## 4 Occidente 119 38440. 26427. 39632. 6240. 142330.
## 5 Sur 136 42186. 35320. 26179. 10338. 163029.
ggplot(df, aes(x = region, y = tourism_gdp)) +
geom_boxplot(fill = "#3399FF", color = "black", outlier.color = "red") +
labs(title = "Distribución del PIB Turístico por Región (2006-2021)",
x = "Región", y = "PIB Turístico (millones MXN)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Dispersión nacional: tourism_gdp vs. crime_rate
ggplot(df, aes(x = crime_rate, y = tourism_gdp)) +
geom_point(color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relación entre PIB Turístico y Tasa de Criminalidad",
x = "Tasa de criminalidad", y = "PIB turístico (millones MXN)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# nivel educativo
ggplot(df, aes(x = college_education, y = tourism_gdp)) +
geom_point(color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relación entre PIB Turístico y Educación Universitaria",
x = "Proporción con educación universitaria", y = "PIB turístico") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Dispersión con facetas por región
ggplot(df, aes(x = real_wage, y = tourism_gdp)) +
geom_point(color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~region) +
labs(title = "PIB Turístico vs. Salario Real por Región",
x = "Salario real", y = "PIB turístico") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# PIB turístico nacional por año
pib_nacional <- df %>%
group_by(year) %>%
summarise(tourism_gdp_total = sum(tourism_gdp, na.rm = TRUE))
# Grafico
ggplot(pib_nacional, aes(x = year, y = tourism_gdp_total)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "blue") +
labs(title = "Tendencia Nacional del PIB Turístico (2006-2021)",
x = "Año", y = "PIB Turístico Total (millones MXN)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
pib_regional <- df %>%
group_by(region, year) %>%
summarise(tourism_gdp_total = sum(tourism_gdp, na.rm = TRUE))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
# Grafico
ggplot(pib_regional, aes(x = year, y = tourism_gdp_total, color = region)) +
geom_line(size = 1) +
geom_point() +
labs(title = "Tendencia del PIB Turístico por Región",
x = "Año", y = "PIB Turístico Total (millones MXN)", color = "Región") +
theme_minimal()
#install.packages(c("sf", "spdep", "tmap", "tidyverse", "readxl"))
library(spdep)
# Leer el archivo .shp
library(sf)
estados_mx <-st_read("C:\\Users\\Diego Pérez\\Downloads\\mx_spatial_data\\mx_spatial_data\\mx_maps\\mx_states\\mexlatlong.shp")
## Reading layer `mexlatlong' from data source
## `C:\Users\Diego Pérez\Downloads\mx_spatial_data\mx_spatial_data\mx_maps\mx_states\mexlatlong.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 32 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -118.4042 ymin: 14.55055 xmax: -86.73862 ymax: 32.71846
## Geodetic CRS: WGS 84
names(estados_mx)
## [1] "OBJECTID" "FIPS_ADMIN" "GMI_ADMIN" "ADMIN_NAME" "FIPS_CNTRY"
## [6] "GMI_CNTRY" "CNTRY_NAME" "POP_ADMIN" "TYPE_ENG" "TYPE_LOC"
## [11] "SQKM" "SQMI" "COLOR_MAP" "Shape_Leng" "Shape_Area"
## [16] "OBJECTID_1" "OBJECTID_2" "longitude" "latitude" "geometry"
head(estados_mx)
## Simple feature collection with 6 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -109.4428 ymin: 22.21486 xmax: -97.1375 ymax: 31.78389
## Geodetic CRS: WGS 84
## OBJECTID FIPS_ADMIN GMI_ADMIN ADMIN_NAME FIPS_CNTRY GMI_CNTRY CNTRY_NAME
## 1 888 MX06 MEX-CHH Chihuahua MX MEX Mexico
## 2 933 MX07 MEX-CDZ Coahuila MX MEX Mexico
## 3 976 MX19 MEX-NLE Nuevo Leon MX MEX Mexico
## 4 978 MX28 MEX-TML Tamaulipas MX MEX Mexico
## 5 998 MX25 MEX-SIN Sinaloa MX MEX Mexico
## 6 1004 MX10 MEX-DRN Durango MX MEX Mexico
## POP_ADMIN TYPE_ENG TYPE_LOC SQKM SQMI COLOR_MAP Shape_Leng
## 1 2656214 State Estado 247935.02 95727.70 12 22.60928
## 2 2145539 State Estado 150843.95 58240.87 2 18.99309
## 3 3370912 State Estado 65173.05 25163.31 3 15.42617
## 4 2272724 State Estado 79502.24 30695.81 11 18.02314
## 5 2397706 State Estado 57638.85 22254.36 5 16.46605
## 6 1467826 State Estado 120674.60 46592.46 4 17.51290
## Shape_Area OBJECTID_1 OBJECTID_2 longitude latitude
## 1 22.890985 1 888 -106.44431 28.80303
## 2 13.733655 2 933 -102.03356 27.30662
## 3 5.844668 3 976 -99.83125 25.60105
## 4 7.056563 4 978 -98.62181 24.29819
## 5 5.145524 5 998 -107.48280 25.02179
## 6 10.764853 6 1004 -104.92001 24.93519
## geometry
## 1 MULTIPOLYGON (((-103.6309 2...
## 2 MULTIPOLYGON (((-102.6669 2...
## 3 MULTIPOLYGON (((-99.7139 27...
## 4 MULTIPOLYGON (((-98.61609 2...
## 5 MULTIPOLYGON (((-108.3942 2...
## 6 MULTIPOLYGON (((-104.3114 2...
# Vista rapida
plot(st_geometry(estados_mx))
# Unir shapefile con los datos
estados_datos <- left_join(estados_mx, df, by = c("ADMIN_NAME" = "state"))
# Crear matriz de vecinos y pesos
vecinos <- poly2nb(estados_datos)
pesos <- nb2listw(vecinos, style = "W")
# Mapa
library(tmap)
tmap_mode("plot")
## ℹ tmap mode set to "plot".
# Mapa 1: tourism_gdp
tm_shape(estados_datos) +
tm_polygons("tourism_gdp",
palette = "Blues",
title = "PIB Turístico") +
tm_layout(main.title = "Distribución Espacial del PIB Turístico",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).[v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`[cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
# Mapa 2: crime_rate
tm_shape(estados_datos) +
tm_polygons("crime_rate",
palette = "Blues",
title = "Índice de Criminalidad") +
tm_layout(main.title = "Distribución Espacial del Crimen",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
# Mapa 3: good_governance
tm_shape(estados_datos) +
tm_polygons("good_governance",
palette = "Blues",
title = "Índice de Buen Gobierno") +
tm_layout(main.title = "Distribución Espacial del Buen Gobierno",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
# Mapa 4: real_wage
tm_shape(estados_datos) +
tm_polygons("real_wage",
palette = "Blues",
title = "Salario Real") +
tm_layout(main.title = "Distribución Espacial del Salario Real",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
library(spdep)
estados_validos <- estados_datos[!is.na(estados_datos$tourism_gdp), ]
estados_validos <- estados_datos[!is.na(estados_datos$real_wage), ]
estados_validos <- estados_datos[!is.na(estados_datos$crime_rate), ]
estados_validos <- estados_datos[!is.na(estados_datos$good_governance), ]
vecinos_validos <- poly2nb(estados_validos)
pesos_validos <- nb2listw(vecinos_validos, style = "W")
moran_gdp_turismo <- moran.test(estados_validos$tourism_gdp, pesos_validos, zero.policy = TRUE)
moran_gdp_turismo
##
## Moran I test under randomisation
##
## data: estados_validos$tourism_gdp
## weights: pesos_validos
##
## Moran I statistic standard deviate = 2.8037, p-value = 0.002526
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 1.580368e-02 -1.901141e-03 3.987752e-05
# Crimen
moran_crimen <- moran.test(estados_validos$crime_rate, pesos_validos, zero.policy = TRUE)
moran_crimen
##
## Moran I test under randomisation
##
## data: estados_validos$crime_rate
## weights: pesos_validos
##
## Moran I statistic standard deviate = 28.098, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 1.747368e-01 -1.901141e-03 3.952118e-05
# Buen gobierno
moran_gobierno <- moran.test(estados_validos$good_governance, pesos_validos, zero.policy = TRUE)
moran_gobierno
##
## Moran I test under randomisation
##
## data: estados_validos$good_governance
## weights: pesos_validos
##
## Moran I statistic standard deviate = 3.6914, p-value = 0.0001115
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 1.433605e-02 -1.901141e-03 1.934785e-05
# Salario
moran_salario <- moran.test(estados_validos$real_wage, pesos_validos, zero.policy = TRUE)
moran_salario
##
## Moran I test under randomisation
##
## data: estados_validos$real_wage
## weights: pesos_validos
##
## Moran I statistic standard deviate = 22.319, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 1.393045e-01 -1.901141e-03 4.002578e-05
# Crime
#estados_validos$local_moran_crimen <- local_moran_crimen[,1]
#estados_validos$pval_moran_crimen <- local_moran_crimen[,5]
#tm_shape(estados_validos) +
tm_polygons("local_moran_crimen",
palette = "RdBu",
title = "Local Moran I - Crime Rate") +
tm_layout(main.title = "Clusters Espaciales del Crime Rate",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [nothing to show] no data layers defined
# Salarios
#estados_validos$local_moran_salarios <- local_moran_salarios[,1]
#estados_validos$pval_moran_salarios <- local_moran_salarios[,5]
#tm_shape(estados_validos) +
tm_polygons("local_moran_salarios",
palette = "RdBu",
title = "Local Moran I - Salarios") +
tm_layout(main.title = "Clusters Espaciales de Salarios",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [nothing to show] no data layers defined
#LEER LOS ARCHIVOS
df_turismo<- read_excel("C:\\Users\\Diego Pérez\\Downloads\\tourism_state_data.xlsx")
## New names:
## • `region` -> `region...17`
## • `region` -> `region...18`
df_hoteles<- read_excel("C:\\Users\\Diego Pérez\\Downloads\\turismo_llegadas_extranjeros.xlsx")
# Filtrar solo datos del año 2022
df_turismo <- df_turismo %>%
filter(year == 2022)
# Renombrar columnas si es necesario
names(df_turismo)[which(names(df_turismo) == "region...17")] <- "region"
names(df_turismo)[which(names(df_turismo) == "region...18")] <- "region_numero"
# Hacer la unión sin modificar los nombres de estado
df_turismo <- df_turismo %>%
left_join(df_hoteles %>% select(estado, total_visitantes),
by = c("state" = "estado"))
#Vemos nuestra base de datos
df_turismo <- df_turismo %>%
mutate(state = ifelse(state == "estado de mexico", "mexico", state))
shapefile <- st_read("C:\\Users\\Diego Pérez\\Downloads\\mx_spatial_data\\mx_spatial_data\\mx_maps\\mx_states\\mexlatlong.shp")
## Reading layer `mexlatlong' from data source
## `C:\Users\Diego Pérez\Downloads\mx_spatial_data\mx_spatial_data\mx_maps\mx_states\mexlatlong.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 32 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -118.4042 ymin: 14.55055 xmax: -86.73862 ymax: 32.71846
## Geodetic CRS: WGS 84
names(shapefile)
## [1] "OBJECTID" "FIPS_ADMIN" "GMI_ADMIN" "ADMIN_NAME" "FIPS_CNTRY"
## [6] "GMI_CNTRY" "CNTRY_NAME" "POP_ADMIN" "TYPE_ENG" "TYPE_LOC"
## [11] "SQKM" "SQMI" "COLOR_MAP" "Shape_Leng" "Shape_Area"
## [16] "OBJECTID_1" "OBJECTID_2" "longitude" "latitude" "geometry"
head(shapefile)
## Simple feature collection with 6 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -109.4428 ymin: 22.21486 xmax: -97.1375 ymax: 31.78389
## Geodetic CRS: WGS 84
## OBJECTID FIPS_ADMIN GMI_ADMIN ADMIN_NAME FIPS_CNTRY GMI_CNTRY CNTRY_NAME
## 1 888 MX06 MEX-CHH Chihuahua MX MEX Mexico
## 2 933 MX07 MEX-CDZ Coahuila MX MEX Mexico
## 3 976 MX19 MEX-NLE Nuevo Leon MX MEX Mexico
## 4 978 MX28 MEX-TML Tamaulipas MX MEX Mexico
## 5 998 MX25 MEX-SIN Sinaloa MX MEX Mexico
## 6 1004 MX10 MEX-DRN Durango MX MEX Mexico
## POP_ADMIN TYPE_ENG TYPE_LOC SQKM SQMI COLOR_MAP Shape_Leng
## 1 2656214 State Estado 247935.02 95727.70 12 22.60928
## 2 2145539 State Estado 150843.95 58240.87 2 18.99309
## 3 3370912 State Estado 65173.05 25163.31 3 15.42617
## 4 2272724 State Estado 79502.24 30695.81 11 18.02314
## 5 2397706 State Estado 57638.85 22254.36 5 16.46605
## 6 1467826 State Estado 120674.60 46592.46 4 17.51290
## Shape_Area OBJECTID_1 OBJECTID_2 longitude latitude
## 1 22.890985 1 888 -106.44431 28.80303
## 2 13.733655 2 933 -102.03356 27.30662
## 3 5.844668 3 976 -99.83125 25.60105
## 4 7.056563 4 978 -98.62181 24.29819
## 5 5.145524 5 998 -107.48280 25.02179
## 6 10.764853 6 1004 -104.92001 24.93519
## geometry
## 1 MULTIPOLYGON (((-103.6309 2...
## 2 MULTIPOLYGON (((-102.6669 2...
## 3 MULTIPOLYGON (((-99.7139 27...
## 4 MULTIPOLYGON (((-98.61609 2...
## 5 MULTIPOLYGON (((-108.3942 2...
## 6 MULTIPOLYGON (((-104.3114 2...
# Primero, asegúrate que ambos sean del mismo tipo
shapefile$OBJECTID <- as.numeric(shapefile$OBJECTID)
df_turismo$state_id <- as.numeric(df_turismo$state_id)
# Join por ID
estados_datos <- left_join(shapefile, df_turismo, by = c("OBJECTID" = "state_id"))
# Obtener solo las columnas numéricas sin geometría
datos_numericos <- estados_datos %>%
st_drop_geometry() %>%
select(tourism_gdp, crime_rate, college_education, unemployment,
employment, business_activity, real_wage, pop_density,
good_governance, ratio_public_investment)
# Estandarizar con scale()
datos_estandarizados <- as.data.frame(scale(datos_numericos))
# Combinar con la geometría y otras variables importantes
estados_datos_std <- estados_datos %>%
select(total_visitantes, geometry) %>%
bind_cols(datos_estandarizados)
# Crear vecinos y lista de pesos
vecinos <- poly2nb(estados_datos_std)
pesos <- nb2listw(vecinos, style = "W", zero.policy = TRUE)
# Asegurarte de que los datos están alineados antes de combinar
stopifnot(nrow(estados_datos) == nrow(datos_estandarizados))
# Combinar datos con cbind (más seguro que bind_cols en este caso)
estados_datos_std <- estados_datos %>%
select(total_visitantes, geometry)
estados_datos_std <- st_as_sf(cbind(estados_datos_std, datos_estandarizados))
# Crear vecinos tipo Queen
vecinos_queen <- poly2nb(estados_datos, queen = TRUE)
# Obtener centroides para trazar líneas
coords <- st_coordinates(st_centroid(estados_datos))
## Warning: st_centroid assumes attributes are constant over geometries
# Mapa de conectividad Queen
plot(st_geometry(estados_datos), border = "gray", main = "Conectividad tipo Queen")
plot(vecinos_queen, coords, add = TRUE, col = "darkgreen", lwd = 1)
# Crear vecinos tipo Rook
vecinos_rook <- poly2nb(estados_datos, queen = FALSE)
# Mapa de conectividad Rook
plot(st_geometry(estados_datos), border = "gray", main = "Conectividad tipo Rook")
plot(vecinos_rook, coords, add = TRUE, col = "purple", lwd = 1)
#Creación de formula espacial
formula_espacial <- total_visitantes ~ tourism_gdp + crime_rate + college_education +
unemployment + business_activity + real_wage +pop_density
# Seleccionar datos de entrenamiento
set.seed(123) # para reproducibilidad
# Variables predictoras
X_ml <- estados_datos_std %>%
st_drop_geometry() %>%
select(tourism_gdp, crime_rate, college_education, unemployment,
employment, business_activity, real_wage, pop_density)
# Variable dependiente
y_ml <- estados_datos_std$total_visitantes
# Entrenar el modelo Random Forest
modelo_rf <- randomForest(x = X_ml, y = y_ml, ntree = 500, importance = TRUE)
# Ver resultados
print(modelo_rf)
##
## Call:
## randomForest(x = X_ml, y = y_ml, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 4.613301e+12
## % Var explained: 4.31
# SAR
modelo_sar <- lagsarlm(formula_espacial, data = estados_datos_std, listw = pesos, zero.policy = TRUE)
## Warning in lagsarlm(formula_espacial, data = estados_datos_std, listw = pesos, : inversion of asymptotic covariance matrix failed for tol.solve = 2.22044604925031e-16
## número de condición recíproco = 7.0604e-26 - using numerical Hessian.
summary(modelo_sar)
##
## Call:lagsarlm(formula = formula_espacial, data = estados_datos_std,
## listw = pesos, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3455217 -754653 15407 521541 6765740
##
## Type: lag
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 921469 325062 2.8347 0.004586
## tourism_gdp 1952871 496130 3.9362 8.278e-05
## crime_rate -66906 310128 -0.2157 0.829192
## college_education 1139387 380460 2.9948 0.002747
## unemployment -277972 407692 -0.6818 0.495353
## business_activity -125034 329749 -0.3792 0.704553
## real_wage -374904 409213 -0.9162 0.359583
## pop_density -1666374 543558 -3.0657 0.002172
##
## Rho: -0.063854, LR test value: 0.093976, p-value: 0.75918
## Approximate (numerical Hessian) standard error: 0.20431
## z-value: -0.31254, p-value: 0.75463
## Wald statistic: 0.097684, p-value: 0.75463
##
## Log likelihood: -502.6744 for lag model
## ML residual variance (sigma squared): 2.5786e+12, (sigma: 1605800)
## Number of observations: 32
## Number of parameters estimated: 10
## AIC: 1025.3, (AIC for lm: 1023.4)
# SEM
model_c <- errorsarlm(formula_espacial, data = estados_datos_std, listw = pesos, zero.policy = TRUE)
## Warning in errorsarlm(formula_espacial, data = estados_datos_std, listw = pesos, : inversion of asymptotic covariance matrix failed for tol.solve = 2.22044604925031e-16
## número de condición recíproco = 1.47547e-25 - using numerical Hessian.
summary(model_c)
##
## Call:errorsarlm(formula = formula_espacial, data = estados_datos_std,
## listw = pesos, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3327371.7 -649821.4 7217.7 513518.6 6604846.2
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 843002 222422 3.7901 0.0001506
## tourism_gdp 2036179 511676 3.9794 6.908e-05
## crime_rate -243820 306722 -0.7949 0.4266585
## college_education 1147430 350640 3.2724 0.0010664
## unemployment -383788 380856 -1.0077 0.3135986
## business_activity -143515 344419 -0.4167 0.6769079
## real_wage -245679 397164 -0.6186 0.5361910
## pop_density -1864200 544714 -3.4223 0.0006208
##
## Lambda: -0.25999, LR test value: 0.52395, p-value: 0.46916
## Approximate (numerical Hessian) standard error: 0.37711
## z-value: -0.68942, p-value: 0.49056
## Wald statistic: 0.4753, p-value: 0.49056
##
## Log likelihood: -502.4595 for error model
## ML residual variance (sigma squared): 2.5049e+12, (sigma: 1582700)
## Number of observations: 32
## Number of parameters estimated: 10
## AIC: 1024.9, (AIC for lm: 1023.4)
# SDM
model_d <- lagsarlm(formula_espacial, data = estados_datos_std, listw = pesos, zero.policy = TRUE, type="mixed")
## Warning in lagsarlm(formula_espacial, data = estados_datos_std, listw = pesos, : inversion of asymptotic covariance matrix failed for tol.solve = 2.22044604925031e-16
## número de condición recíproco = 1.30412e-25 - using numerical Hessian.
summary(model_d)
##
## Call:lagsarlm(formula = formula_espacial, data = estados_datos_std,
## listw = pesos, type = "mixed", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1957395 -666180 -92008 731111 3644042
##
## Type: mixed
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1495064 317330 4.7114 2.46e-06
## tourism_gdp 1374930 504781 2.7238 0.0064532
## crime_rate -482576 274910 -1.7554 0.0791921
## college_education 409909 407063 1.0070 0.3139394
## unemployment -542126 344241 -1.5748 0.1152917
## business_activity -551471 309103 -1.7841 0.0744068
## real_wage 367577 386761 0.9504 0.3419099
## pop_density -1168390 544294 -2.1466 0.0318239
## lag.tourism_gdp 162513 1323184 0.1228 0.9022501
## lag.crime_rate -1981441 520527 -3.8066 0.0001409
## lag.college_education 2578928 971675 2.6541 0.0079519
## lag.unemployment -3324327 1104626 -3.0095 0.0026171
## lag.business_activity -732334 484433 -1.5117 0.1306018
## lag.real_wage 1907740 771552 2.4726 0.0134134
## lag.pop_density -1430235 1401019 -1.0209 0.3073239
##
## Rho: -0.66286, LR test value: 6.2771, p-value: 0.012231
## Approximate (numerical Hessian) standard error: 0.23625
## z-value: -2.8058, p-value: 0.0050191
## Wald statistic: 7.8725, p-value: 0.0050191
##
## Log likelihood: -493.5157 for mixed model
## ML residual variance (sigma squared): 1.3044e+12, (sigma: 1142100)
## Number of observations: 32
## Number of parameters estimated: 17
## AIC: 1021, (AIC for lm: 1025.3)
# Calcular RMSE
rmse <- function(y_real, y_pred) {
sqrt(mean((y_real - y_pred)^2))
}
# SAR
pred_sar <- modelo_sar$fit
rmse_sar <- rmse(estados_datos_std$total_visitantes, pred_sar)
# SEM
pred_sem <- model_c$fit
rmse_sem <- rmse(estados_datos_std$total_visitantes, pred_sem)
# SDM
pred_sdm <- model_d$fit
rmse_sdm <- rmse(estados_datos_std$total_visitantes, pred_sdm)
cat("RMSE Comparativo:\n")
## RMSE Comparativo:
cat("SAR: ", rmse_sar, "\n")
## SAR: 1605793
cat("SEM: ", rmse_sem, "\n")
## SEM: 1582681
cat("SDM: ", rmse_sdm, "\n")
## SDM: 1142111
# Convertir a objeto Spatial si es sf
estados_sp <- as(estados_datos_std, "Spatial")
# Definir fórmula
formula_espacial1 <- total_visitantes ~ tourism_gdp + crime_rate + college_education +
unemployment + business_activity + real_wage + pop_density
# Calcular bandwidth óptimo y ajustar modelo GWR
modelo_gwr <- gwr.basic(
formula = formula_espacial1,
data = estados_sp,
bw = bw.gwr(formula_espacial1, data = estados_sp, approach = "AICc",
kernel = "bisquare", adaptive = TRUE),
kernel = "bisquare",
adaptive = TRUE
)
## Adaptive bandwidth (number of nearest neighbours): 27 AICc value: 1027.464
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 1035.885
## Adaptive bandwidth (number of nearest neighbours): 30 AICc value: 1025.743
## Adaptive bandwidth (number of nearest neighbours): 30 AICc value: 1025.743
# Ver resumen
summary(modelo_gwr)
## Length Class Mode
## GW.arguments 11 -none- list
## GW.diagnostic 8 -none- list
## lm 14 lm list
## SDF 32 SpatialPolygonsDataFrame S4
## timings 5 -none- list
## this.call 6 -none- call
## Ftests 0 -none- list
# Crear modelo OLS antes de los diagnósticos
modelo_ols <- lm(tourism_gdp ~ crime_rate + real_wage + college_education, data = estados_datos)
summary(modelo_ols)
##
## Call:
## lm(formula = tourism_gdp ~ crime_rate + real_wage + college_education,
## data = estados_datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84553 -40657 -19855 28899 174486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -225254.3 92721.4 -2.429 0.0218 *
## crime_rate -136.4 413.3 -0.330 0.7438
## real_wage 497.0 288.7 1.721 0.0962 .
## college_education 407753.2 240585.6 1.695 0.1012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63220 on 28 degrees of freedom
## Multiple R-squared: 0.2935, Adjusted R-squared: 0.2178
## F-statistic: 3.877 on 3 and 28 DF, p-value: 0.0195
library(car) # Para VIF
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(lmtest) # Para heterocedasticidad
## Warning: package 'lmtest' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.1
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(spdep)
# Prueba de multicolinealidad
vif(modelo_ols)
## crime_rate real_wage college_education
## 1.005809 1.311691 1.308753
# Prueba de heterocedasticidad (Breusch-Pagan)
bptest(modelo_ols)
##
## studentized Breusch-Pagan test
##
## data: modelo_ols
## BP = 11.384, df = 3, p-value = 0.009819
modelo_ols <- lm(tourism_gdp ~ crime_rate + real_wage + college_education, data = estados_validos)
residuos_ols <- residuals(modelo_ols)
moran_residuos_ols <- moran.test(residuos_ols, pesos_validos, zero.policy = TRUE)
moran_residuos_ols
##
## Moran I test under randomisation
##
## data: residuos_ols
## weights: pesos_validos
##
## Moran I statistic standard deviate = -0.010453, p-value = 0.5042
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -1.967141e-03 -1.901141e-03 3.986416e-05
# AIC de los modelos
aic_ols <- AIC(modelo_ols)
aic_sar <- AIC(modelo_sar)
aic_sem <- AIC(model_c)
aic_sdm <- AIC(model_d)
# Imprimir AIC
data.frame(
Modelo = c("OLS", "SAR", "SEM", "SDM"),
AIC = c(aic_ols, aic_sar, aic_sem, aic_sdm)
)
## Modelo AIC
## 1 OLS 12664.974
## 2 SAR 1025.349
## 3 SEM 1024.919
## 4 SDM 1021.031
# Función RMSE
rmse <- function(model) {
sqrt(mean(residuals(model)^2))
}
# Calcular RMSE
rmse_ols <- rmse(modelo_ols)
rmse_sar <- rmse(modelo_sar)
rmse_sem <- rmse(model_c)
rmse_sdm <- rmse(model_d)
# Imprimir RMSE
data.frame(
Modelo = c("OLS", "SAR", "SEM", "SDM"),
RMSE = c(rmse_ols, rmse_sar, rmse_sem, rmse_sdm)
)
## Modelo RMSE
## 1 OLS 39643.33
## 2 SAR 1605792.81
## 3 SEM 1582681.29
## 4 SDM 1142110.86
# Predicción de GWR
predicciones_gwr <- modelo_gwr$SDF$yhat
estados_datos_std$pred_gwr <- predicciones_gwr
estados_datos_std$resid_gwr <- estados_datos_std$total_visitantes - predicciones_gwr
# Mapa de residuos
tm_shape(estados_datos_std) +
tm_polygons("resid_gwr", palette = "-RdBu", style = "quantile",
title = "Residuales GWR") +
tm_layout(main.title = "Errores de predicción - Modelo GWR",
legend.outside = TRUE)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## Multiple palettes called "rd_bu" found: "brewer.rd_bu", "matplotlib.rd_bu". The first one, "brewer.rd_bu", is returned.
##
## [scale] tm_polygons:() the data variable assigned to 'fill' contains positive and negative values, so midpoint is set to 0. Set 'midpoint = NA' in 'fill.scale = tm_scale_intervals(<HERE>)' to use all visual values (e.g. colors)
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "-RdBu" is named
## "rd_bu" (in long format "brewer.rd_bu")
#install.packages(c("plm", "spdep", "splm"))
library(plm)
## Warning: package 'plm' was built under R version 4.3.3
##
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
library(splm)
## Warning: package 'splm' was built under R version 4.3.3
# Convertir la base a panel (índice: estado y año)
df_panel <- pdata.frame(df, index = c("state", "year"))
# Modelo con efectos fijos
modelo_fe <- plm(tourism_gdp ~ crime_rate + real_wage + college_education, data = df_panel, model = "within")
summary(modelo_fe)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = tourism_gdp ~ crime_rate + real_wage + college_education,
## data = df_panel, model = "within")
##
## Balanced Panel: n = 32, T = 17, N = 544
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -130264.47 -3078.67 -31.66 2885.45 80249.22
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## crime_rate 31.564 38.165 0.8270 0.4086
## real_wage 8.584 51.270 0.1674 0.8671
## college_education -1572.388 22083.643 -0.0712 0.9433
##
## Total Sum of Squares: 9.186e+10
## Residual Sum of Squares: 9.1701e+10
## R-Squared: 0.0017365
## Adj. R-Squared: -0.064945
## F-statistic: 0.295134 on 3 and 509 DF, p-value: 0.82893
# Modelo con efectos aleatorios
modelo_re <- plm(tourism_gdp ~ crime_rate + real_wage + college_education, data = df_panel, model = "random")
summary(modelo_re)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = tourism_gdp ~ crime_rate + real_wage + college_education,
## data = df_panel, model = "random")
##
## Balanced Panel: n = 32, T = 17, N = 544
##
## Effects:
## var std.dev share
## idiosyncratic 1.802e+08 1.342e+04 0.048
## individual 3.596e+09 5.997e+04 0.952
## theta: 0.9458
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -112412.0 -4327.9 -1414.0 1836.5 98018.3
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 46356.173 16786.023 2.7616 0.005752 **
## crime_rate 26.761 38.590 0.6935 0.488003
## real_wage 35.381 51.136 0.6919 0.488999
## college_education -7439.525 22179.697 -0.3354 0.737308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 1.0007e+11
## Residual Sum of Squares: 9.983e+10
## R-Squared: 0.0024021
## Adj. R-Squared: -0.0031402
## Chisq: 1.30023 on 3 DF, p-value: 0.72908
# Prueba de Hausman
hausman_test <- phtest(modelo_fe, modelo_re)
hausman_test
##
## Hausman Test
##
## data: tourism_gdp ~ crime_rate + real_wage + college_education
## chisq = 50.737, df = 3, p-value = 5.566e-11
## alternative hypothesis: one model is inconsistent
# Agregar predicciones del modelo elegido (por ejemplo, efectos fijos)
df$pred_panel <- predict(modelo_fe)
# Gráfico: valores observados vs predichos (promediado por año)
library(ggplot2)
df_panel_sum <- df %>%
group_by(year) %>%
summarise(observado = mean(tourism_gdp, na.rm = TRUE),
predicho = mean(pred_panel, na.rm = TRUE))
ggplot(df_panel_sum, aes(x = year)) +
geom_line(aes(y = observado, color = "Observado")) +
geom_line(aes(y = predicho, color = "Predicho")) +
labs(title = "Comparación PIB Turístico Observado vs. Predicho (Modelo Panel)",
y = "PIB Turístico promedio", color = "") +
theme_minimal()
Se estimaron modelos de datos panel con efectos fijos (FE) y aleatorios (RE) para analizar el impacto de la criminalidad, salario real y nivel educativo sobre el PIB turístico estatal.
El modelo de efectos fijos no mostró variables significativas y tuvo bajo poder explicativo (R² ajustado negativo), pero permite controlar diferencias no observadas entre estados.
El modelo de efectos aleatorios también presentó bajo ajuste y sin significancia estadística.
La prueba de Hausman fue significativa (p < 0.001), indicando que el modelo de efectos fijos es más apropiado.
Los modelos de datos panel se diferencian de los de corte transversal porque permiten observar múltiples unidades a lo largo del tiempo, lo que mejora la capacidad para detectar dinámicas y controlar efectos no observables. Esto aumenta la precisión de las estimaciones y reduce el sesgo. Además, los modelos panel pueden manejar estructuras complejas, como efectos fijos o aleatorios, que los modelos de corte transversal no capturan.
La autocorrelación en datos panel se puede identificar de dos formas: temporal y espacial. La autocorrelación temporal se detecta comúnmente con la prueba de Wooldridge, que evalúa dependencia serial dentro de las unidades. La autocorrelación espacial se evalúa con el índice de Moran aplicado a los residuos del modelo, lo que permite identificar si hay dependencia espacial entre las observaciones después de ajustar por covariables.
El modelo espacial de datos panel suele ofrecer una mejor predicción que el modelo espacial de corte transversal, ya que considera tanto las dependencias espaciales como las variaciones a lo largo del tiempo. Esto permite capturar efectos persistentes entre unidades geográficas y mejora la robustez del modelo. Aunque requiere más información y procesamiento, su capacidad explicativa y predictiva es superior en contextos como el análisis turístico regional.