El propósito de esta actividad es poner en práctica lo estudiado hasta ahora en los temas de Análisis Exploratorio de Datos Espaciales, así como Modelos de Regresión Lineal Espaciales.
Primeramente aquí se proporcionan las librerías principales que deberás considerar. De todas maneras si deseas agregar cualquier otra librería lo puedes hacer y recomendaría hacerlo de una vez dentro del siguiente chunk de código:
#install.packages("dlookr")
library(foreign)
library(dplyr)
library(spdep)
library(tigris)
library(tidyr)
library(rgeoda)
library(RColorBrewer)
library(viridis)
library(ggplot2)
library(tmap)
library(sf)
library(sp)
library(spatialreg)
library(stargazer)
library(plm)
library(splm)
library(pspatreg)
library(regclass)
library(mctest)
library(lmtest)
library(spData)
library(mapview)
library(naniar)
library(dlookr)
library(caret)
library(e1071)
library(SparseM)
library(Metrics)
library(randomForest)
library(rpart.plot)
library(insight)
library(jtools)
library(xgboost)
library(DiagrammeR)
library(effects)
library(leaflet)
library(GGally)
A continuación se carga la base de datos a ser analizada: ‘boston’ de la librería ‘spData’
data(boston, package="spData")
head(data)
##
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,
## 2 verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
## 3 {
## 4 fileExt <- function(x) {
## 5 db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)
## 6 ans <- sub(".*\\\\.", "", x)
glimpse(data())
## List of 4
## $ title : chr "Data sets"
## $ header : NULL
## $ results: chr [1:541, 1:4] "GGally" "GGally" "GGally" "GGally" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "Package" "LibPath" "Item" "Title"
## $ footer : chr "Use 'data(package = .packages(all.available = TRUE))'\nto list the data sets in all *available* packages."
## - attr(*, "class")= chr "packageIQR"
Seguidamente, se cargan las geocercas asociadas a la data transversal previamente cargada
boston.tr <-st_read(system.file("shapes/boston_tracts.gpkg", package="spData")[1])
## Reading layer `boston_tracts' from data source
## `C:\Users\mari0\AppData\Local\R\win-library\4.4\spData\shapes\boston_tracts.gpkg'
## using driver `GPKG'
## Simple feature collection with 506 features and 36 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## Geodetic CRS: NAD27
Ejemplo de primer visualización en Mapas utilizando los datos ya cargados y visualizando los valores utilizando una variedad de paletas de colores.
boston.trSP<-as(boston.tr, "Spatial")
boston_nb<-poly2nb(boston.trSP)
mapview(boston.trSP, zcol="CMEDV", col.regions = colorRampPalette(c('lightgrey', 'darkblue')))
mapview(boston.trSP, zcol="CMEDV", col.regions = brewer.pal(7,'Blues'))
## Warning: Found less unique colors (7) than unique zcol values (228)!
## Interpolating color vector to match number of zcol values.
mapview(boston.trSP, zcol="CMEDV", col.regions = viridisLite::magma(20))
## Warning: Found less unique colors (20) than unique zcol values (228)!
## Interpolating color vector to match number of zcol values.
La meta esencial del análisis que llevarás a cabo será explicar la variable ‘CRIM’ o ‘CMEDV’ de acuerdo a la paridad del número de su equipo: - Equipos Impares 1, 3, 5 y 7 explicarán ‘CRIM’ - Equipos Pares 2, 4, 6 explicarán ‘CMEDV’ (en este caso deberán de omitir la variable ‘MEDV’ de la base de datos)
Realiza un breve análisis exploratorio de datos tradicional. Cuestiones que se deberán de incluir en esta sección incluyen: - Histogramas - Matriz de Correlaciones - Boxplots
boston_data <- boston.tr %>% st_drop_geometry()
numeric_vars <- boston_data %>%
select(where(is.numeric)) %>%
select(-MEDV)
long_data <- pivot_longer(numeric_vars, everything(), names_to = "Variable", values_to = "Valor")
ggplot(long_data, aes(x = Valor)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
theme_minimal() +
labs(title = "Histogramas de Variables Numéricas")
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_bin()`).
subset_vars <- numeric_vars %>%
select(CMEDV, RM, NOX, DIS, PTRATIO, LSTAT)
ggpairs(subset_vars) +
theme_minimal() +
ggtitle("Matriz de Correlaciones")
library(corrplot)
## corrplot 0.92 loaded
# Calcular matriz de correlación
cor_matrix <- cor(numeric_vars)
top_corrs <- as.data.frame(as.table(cor_matrix)) %>%
filter(Var1 != Var2) %>%
mutate(abs_corr = abs(Freq)) %>%
arrange(desc(abs_corr)) %>%
distinct(abs_corr, .keep_all = TRUE) %>%
slice_head(n = 10)
top_vars <- unique(c(top_corrs$Var1, top_corrs$Var2))
top10_data <- numeric_vars[, top_vars]
top10_corr_matrix <- cor(top10_data)
corrplot(top10_corr_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, title = "Top 10 Correlaciones", mar = c(0,0,2,0))
boston.tr$CMEDV <- as.factor(boston.tr$CMEDV)
# Boxplots para algunas variables contra CMEDV
ggplot(boston.tr, aes(x = CMEDV, y = RM, fill = CMEDV)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot de RM (habitaciones) según CMEDV", x = "Categoría de Valor Medio (CMEDV)", y = "Número de Habitaciones")
ggplot(boston.tr, aes(x = CMEDV, y = LSTAT, fill = CMEDV)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot de LSTAT (% población de bajos recursos) según CMEDV", x = "Categoría de Valor Medio (CMEDV)", y = "% LSTAT")
ggplot(boston.tr, aes(y = LSTAT)) +
geom_boxplot(fill = "skyblue") +
theme_minimal() +
labs(title = "Boxplot de LSTAT",
y = "RM")
ggplot(boston.tr, aes(y = RM)) +
geom_boxplot(fill = "skyblue") +
theme_minimal() +
labs(title = "Boxplot de RM (promedio de habitaciones)",
y = "RM")
mapview(boston.trSP, zcol="NOX", col.regions = colorRampPalette(c('lightgrey', 'darkgreen')))
mapview(boston.trSP, zcol="LSTAT", col.regions = colorRampPalette(c('lightgrey', 'darkorange')))
Realiza un análisis exploratorio de datos espaciales. Los puntos esenciales a generar en esta sección son los siguientes: - Definir claramente la matriz de conectividad espacial - Graficar dicha matriz, utilizar la estructura ‘Reina’ para determinar los vecinos - Realizar mapa coloreado de acuerdo al valor de la variable que se desea explicar y al menos 2 de las variables independientes que se utilizarán (mapa de 5 variables: high-high, low-low, high-low, low-high) - Realizar análisis de clusters: + Incluir mapa de un lag espacial de la variable que se desea explicar junto con el mapa original previamente realizado + Visualización Espacial de Clusters (HotSpots, ColdSpots, NonSignificant y Atípicos: HighLows, LowHighs) - Índice Global de Moran + Calcular el Estadístico del Índice Global de Moran para las variables previamente graficadas (al menos son 3: 1 dependiente, 2 independientes) + Generar sus respectivos gráficos de dispersión
swm <- poly2nb(boston.tr, queen=T)
summary(swm)
## Neighbour list object:
## Number of regions: 506
## Number of nonzero links: 2910
## Percentage nonzero weights: 1.136559
## Average number of links: 5.750988
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 15
## 3 9 28 81 107 120 87 40 22 5 2 1 1
## 3 least connected regions:
## 18 51 345 with 1 link
## 1 most connected region:
## 112 with 15 links
sswm <- nb2listw(swm, style="W", zero.policy = TRUE)
boston_a <- as(boston.tr, "Spatial")
boston_centroid <- coordinates(boston_a)
plot(boston_a,border="blue",axes=FALSE,las=1, main="Boston's States Queen SWM")
plot(boston_a,col="grey",border=grey(0.9),axes=T,add=T)
plot(sswm,coords=boston_centroid,pch=19,cex=0.1,col="red",add=T)
# Clasificación cuantil para categorizar valores
boston.tr$CMEDV_num <- as.numeric(as.character(boston.tr$CMEDV))
boston.tr$CMEDV_cat <- cut(boston.tr$CMEDV_num,
breaks = quantile(boston.tr$CMEDV_num, probs = seq(0, 1, 0.25), na.rm = TRUE),
include.lowest = TRUE)
boston.tr$NOX_cat <- cut(boston.tr$NOX, breaks = quantile(boston.tr$NOX, probs = seq(0, 1, 0.25)), include.lowest = TRUE)
boston.tr$RM_cat <- cut(boston.tr$RM, breaks = quantile(boston.tr$RM, probs = seq(0, 1, 0.25)), include.lowest = TRUE)
# Mapas individuales
tm1 <- tm_shape(boston.tr) +
tm_fill("CMEDV_cat", palette = "YlOrRd", title = "CMEDV (Target)") +
tm_borders()
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
## visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
## tm_scale(<HERE>).
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm2 <- tm_shape(boston.tr) +
tm_fill("NOX_cat", palette = "Purples", title = "NOX (Independent)") +
tm_borders()
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm3 <- tm_shape(boston.tr) +
tm_fill("RM_cat", palette = "Blues", title = "RM (Independent)") +
tm_borders()
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
# Mostrar mapas en conjunto
tmap_arrange(tm1, tm2, tm3)
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "YlOrRd" is named
## "brewer.yl_or_rd"
## Multiple palettes called "yl_or_rd" found: "brewer.yl_or_rd", "matplotlib.yl_or_rd". The first one, "brewer.yl_or_rd", is returned.
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Purples" is named
## "brewer.purples"
## Multiple palettes called "purples" found: "brewer.purples", "matplotlib.purples". The first one, "brewer.purples", is returned.
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
# Crear matriz de vecinos
neighbors <- poly2nb(boston.tr)
sswm <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
boston.tr$CMEDV <- as.numeric(as.character(boston.tr$CMEDV))
# Lag espacial para CMEDV
boston.tr$lag_CMEDV <- lag.listw(sswm, boston.tr$CMEDV, zero.policy = TRUE)
boston.tr$lag_NOX <- lag.listw(sswm, boston.tr$NOX, zero.policy = TRUE)
boston.tr$lag_RM <- lag.listw(sswm, boston.tr$RM, zero.policy = TRUE)
# Mapa original + mapa de lag espacial
tm_original <- tm_shape(boston.tr) +
tm_fill("CMEDV", palette = "YlOrRd", title = "CMEDV Original") +
tm_borders()
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
## visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
## tm_scale(<HERE>).
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm_lag <- tm_shape(boston.tr) +
tm_fill("lag_CMEDV", palette = "Oranges", title = "Lag Espacial CMEDV") +
tm_borders()
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tmap_arrange(tm_original, tm_lag)
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "YlOrRd" is named
## "brewer.yl_or_rd"
## Multiple palettes called "yl_or_rd" found: "brewer.yl_or_rd", "matplotlib.yl_or_rd". The first one, "brewer.yl_or_rd", is returned.
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
# LISA para CMEDV
neighbors_geo <- queen_weights(boston.tr)
lisa_result <- local_moran(neighbors_geo, data.frame(CMEDV = boston.tr$CMEDV))
#boston.tr$lisa_cluster <- as.factor(lisa_clusters(lisa_result))
boston.tr$cluster_cmedv <- as.factor(lisa_result$GetClusterIndicators())
levels(boston.tr$cluster_cmedv) <- lisa_result$GetLabels()
# Mapa de clusters
ggplot(data = boston.tr) +
geom_sf(aes(fill = cluster_cmedv)) +
scale_fill_manual(values = c("lightgrey", "#9F9FED", "#CAD593", "#2D936C", "skyblue")) +
theme_minimal() +
labs(
title = "Cluster LISA para CMEDV",
subtitle = "Análisis Espacial Local en Boston",
fill = "Tipo de Cluster"
)
# Cálculo del índice para CMEDV, NOX y RM
boston.tr$CMEDV <- as.numeric(as.character(boston.tr$CMEDV))
moran_cmedv <- moran.test(boston.tr$CMEDV, sswm)
moran_nox <- moran.test(boston.tr$NOX, sswm)
moran_rm <- moran.test(boston.tr$RM, sswm)
# Mostrar resultados
print(moran_cmedv)
##
## Moran I test under randomisation
##
## data: boston.tr$CMEDV
## weights: sswm
##
## Moran I statistic standard deviate = 23.558, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.6322686784 -0.0019801980 0.0007248376
print(moran_nox)
##
## Moran I test under randomisation
##
## data: boston.tr$NOX
## weights: sswm
##
## Moran I statistic standard deviate = 33.692, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.9065225426 -0.0019801980 0.0007271085
print(moran_rm)
##
## Moran I test under randomisation
##
## data: boston.tr$RM
## weights: sswm
##
## Moran I statistic standard deviate = 17.094, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.4580618698 -0.0019801980 0.0007242995
# Gráficos de dispersión para Moran (espacial vs variable original)
m_cmedv <- lm(boston.tr$lag_CMEDV ~ boston.tr$CMEDV)
plot(boston.tr$lag_CMEDV ~ boston.tr$CMEDV,
pch = 21, asp = 1, las = 1, col = "grey40", bg = "grey80",
main = "CMEDV - Índice de Moran",
xlab = "CMEDV", ylab = "Lag espacial de CMEDV")
abline(m_cmedv, col = "blue")
abline(v = mean(boston.tr$CMEDV, na.rm = TRUE), lty = 3, col = "grey60")
abline(h = mean(boston.tr$lag_CMEDV, na.rm = TRUE), lty = 3, col = "grey60")
# Repetir para NOX
m_nox <- lm(boston.tr$lag_NOX ~ boston.tr$NOX)
plot(boston.tr$lag_NOX ~ boston.tr$NOX,
pch = 21, asp = 1, las = 1, col = "grey40", bg = "grey80",
main = "NOX - Índice de Moran",
xlab = "NOX", ylab = "Lag espacial de NOX")
abline(m_nox, col = "blue")
abline(v = mean(boston.tr$NOX, na.rm = TRUE), lty = 3, col = "grey60")
abline(h = mean(boston.tr$lag_NOX, na.rm = TRUE), lty = 3, col = "grey60")
# RM
m_rm <- lm(boston.tr$lag_RM ~ boston.tr$RM)
plot(boston.tr$lag_RM ~ boston.tr$RM,
pch = 21, asp = 1, las = 1, col = "grey40", bg = "grey80",
main = "RM - Índice de Moran",
xlab = "RM", ylab = "Lag espacial de RM")
abline(m_rm, col = "blue")
abline(v = mean(boston.tr$RM, na.rm = TRUE), lty = 3, col = "grey60")
abline(h = mean(boston.tr$lag_RM, na.rm = TRUE), lty = 3, col = "grey60")
Por último deberás de ajustar los Modelos de Regresión estudiados en este Módulo: - Modelo de Regresión Lineal Tradicional - Modelo de Regresión Espacial AutoRegresivo (SAR) - Modelo de Regresión Espacial de Errores (SEM) - Modelo de Regresión Espacial Durbin
# Prepara los datos sin la variable MEDV
boston_data <- boston.tr %>% st_drop_geometry() %>% select(-MEDV)
summary(boston_data)
## poltract TOWN TOWNNO TRACT
## Length:506 Length:506 Min. : 0.00 Min. : 1
## Class :character Class :character 1st Qu.:26.25 1st Qu.:1303
## Mode :character Mode :character Median :42.00 Median :3394
## Mean :47.53 Mean :2700
## 3rd Qu.:78.00 3rd Qu.:3740
## Max. :91.00 Max. :5082
##
## LON LAT CMEDV CRIM
## Min. :-71.29 Min. :42.03 Min. : 5.00 Min. : 0.00632
## 1st Qu.:-71.09 1st Qu.:42.18 1st Qu.:17.02 1st Qu.: 0.08205
## Median :-71.05 Median :42.22 Median :21.20 Median : 0.25651
## Mean :-71.06 Mean :42.22 Mean :22.53 Mean : 3.61352
## 3rd Qu.:-71.02 3rd Qu.:42.25 3rd Qu.:25.00 3rd Qu.: 3.67708
## Max. :-70.81 Max. :42.38 Max. :50.00 Max. :88.97620
##
## ZN INDUS CHAS NOX
## Min. : 0.00 Min. : 0.46 Length:506 Min. :0.3850
## 1st Qu.: 0.00 1st Qu.: 5.19 Class :character 1st Qu.:0.4490
## Median : 0.00 Median : 9.69 Mode :character Median :0.5380
## Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :100.00 Max. :27.74 Max. :0.8710
##
## RM AGE DIS RAD
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
##
## TAX PTRATIO B LSTAT
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
##
## units cu5k c5_7_5 C7_5_10
## Min. : 5.0 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 115.0 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 511.5 Median : 1.000 Median : 3.000 Median : 6.000
## Mean : 680.8 Mean : 2.921 Mean : 5.534 Mean : 9.984
## 3rd Qu.:1152.0 3rd Qu.: 4.000 3rd Qu.: 7.000 3rd Qu.: 13.000
## Max. :3031.0 Max. :35.000 Max. :70.000 Max. :121.000
##
## C10_15 C15_20 C20_25 C25_35
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 14.00 1st Qu.: 19.0 1st Qu.: 13.0 1st Qu.: 7.0
## Median : 33.00 Median : 85.0 Median :101.5 Median : 95.0
## Mean : 55.41 Mean :141.8 Mean :166.2 Mean : 170.8
## 3rd Qu.: 75.50 3rd Qu.:210.0 3rd Qu.:281.8 3rd Qu.: 292.0
## Max. :520.00 Max. :937.0 Max. :723.0 Max. :1189.0
##
## C35_50 co50k median BB
## Min. : 0.00 Min. : 0.00 Min. : 5600 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.:16800 1st Qu.: 0.200
## Median : 17.00 Median : 3.00 Median :21000 Median : 0.500
## Mean : 82.74 Mean : 45.44 Mean :21749 Mean : 6.082
## 3rd Qu.: 97.00 3rd Qu.: 20.00 3rd Qu.:24700 3rd Qu.: 1.600
## Max. :769.00 Max. :980.00 Max. :50000 Max. :96.400
## NA's :17
## censored NOX_ID POP CMEDV_num
## Length:506 Min. : 1.00 Min. : 434 Min. : 5.00
## Class :character 1st Qu.:18.00 1st Qu.: 3697 1st Qu.:17.02
## Mode :character Median :44.00 Median : 5105 Median :21.20
## Mean :41.77 Mean : 5340 Mean :22.53
## 3rd Qu.:62.00 3rd Qu.: 6825 3rd Qu.:25.00
## Max. :96.00 Max. :15976 Max. :50.00
##
## CMEDV_cat NOX_cat RM_cat lag_CMEDV
## [5,17] :127 [0.385,0.449]:129 [3.56,5.89]:127 Min. : 5.70
## (17,21.2]:129 (0.449,0.538]:143 (5.89,6.21]:126 1st Qu.:18.18
## (21.2,25]:126 (0.538,0.624]:110 (6.21,6.62]:126 Median :22.40
## (25,50] :124 (0.624,0.871]:124 (6.62,8.78]:127 Mean :22.76
## 3rd Qu.:27.06
## Max. :50.00
##
## lag_NOX lag_RM cluster_cmedv
## Min. :0.3930 Min. :4.138 Not significant:306
## 1st Qu.:0.4647 1st Qu.:6.004 High-High : 89
## Median :0.5371 Median :6.263 Low-Low :106
## Mean :0.5551 Mean :6.297 Low-High : 5
## 3rd Qu.:0.6387 3rd Qu.:6.577 High-Low : 0
## Max. :0.8710 Max. :7.739 Undefined : 0
## Isolated : 0
boston_data <- boston_data %>%
select(-poltract, -TOWN, -censored, -TOWNNO, -TRACT)
colSums(is.na(boston_data))
## LON LAT CMEDV CRIM ZN
## 0 0 0 0 0
## INDUS CHAS NOX RM AGE
## 0 0 0 0 0
## DIS RAD TAX PTRATIO B
## 0 0 0 0 0
## LSTAT units cu5k c5_7_5 C7_5_10
## 0 0 0 0 0
## C10_15 C15_20 C20_25 C25_35 C35_50
## 0 0 0 0 0
## co50k median BB NOX_ID POP
## 0 17 0 0 0
## CMEDV_num CMEDV_cat NOX_cat RM_cat lag_CMEDV
## 0 0 0 0 0
## lag_NOX lag_RM cluster_cmedv
## 0 0 0
boston_data <- na.omit(boston_data)
nrow(boston_data)
## [1] 489
subset_indices <- which(complete.cases(boston_data))
boston_tr_clean <- boston.tr[subset_indices, ]
# Crear vecinos y lista de pesos para spdep
boston_nb <- poly2nb(boston_tr_clean)
listw <- nb2listw(boston_nb, style = "W", zero.policy = TRUE)
length(listw$neighbours)
## [1] 489
# Modelo tradicional
modelo_ols <- lm(CMEDV ~ ., data = boston_data)
summary(modelo_ols)
## Warning in summary.lm(modelo_ols): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = CMEDV ~ ., data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.035e-15 -7.800e-18 1.610e-18 1.217e-17 1.627e-16
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.418e-13 9.587e-15 2.523e+01 < 2e-16 ***
## LON 3.055e-15 8.904e-17 3.431e+01 < 2e-16 ***
## LAT -6.105e-16 1.569e-16 -3.891e+00 0.000115 ***
## CRIM -2.117e-17 8.835e-19 -2.396e+01 < 2e-16 ***
## ZN 4.197e-19 3.447e-19 1.218e+00 0.223988
## INDUS 1.633e-17 1.645e-18 9.925e+00 < 2e-16 ***
## CHAS1 1.158e-16 2.165e-17 5.350e+00 1.41e-07 ***
## NOX -1.035e-15 2.135e-16 -4.849e+00 1.72e-06 ***
## RM 1.846e-16 1.930e-17 9.561e+00 < 2e-16 ***
## AGE -5.362e-19 3.709e-19 -1.446e+00 0.148917
## DIS -6.700e-17 5.890e-18 -1.138e+01 < 2e-16 ***
## RAD -1.440e-17 1.889e-18 -7.622e+00 1.53e-13 ***
## TAX 1.206e-18 8.893e-20 1.357e+01 < 2e-16 ***
## PTRATIO -7.331e-18 3.422e-18 -2.143e+00 0.032683 *
## B -1.107e-18 1.313e-19 -8.434e+00 4.75e-16 ***
## LSTAT -7.471e-18 1.425e-18 -5.242e+00 2.47e-07 ***
## units -5.303e-19 1.009e-19 -5.257e+00 2.28e-07 ***
## cu5k 1.288e-18 1.733e-18 7.440e-01 0.457559
## c5_7_5 2.302e-18 1.281e-18 1.797e+00 0.072982 .
## C7_5_10 -2.465e-19 1.091e-18 -2.260e-01 0.821348
## C10_15 5.467e-20 3.250e-19 1.680e-01 0.866482
## C15_20 1.390e-18 1.616e-19 8.598e+00 < 2e-16 ***
## C20_25 -5.935e-19 1.789e-19 -3.318e+00 0.000982 ***
## C25_35 1.324e-18 1.138e-19 1.163e+01 < 2e-16 ***
## C35_50 2.166e-19 1.960e-19 1.105e+00 0.269674
## co50k NA NA NA NA
## median -4.316e-19 3.062e-20 -1.410e+01 < 2e-16 ***
## BB 2.510e-19 6.516e-19 3.850e-01 0.700301
## NOX_ID -4.615e-21 5.474e-19 -8.000e-03 0.993277
## POP 1.354e-21 3.437e-21 3.940e-01 0.693719
## CMEDV_num 1.000e+00 3.059e-17 3.269e+16 < 2e-16 ***
## CMEDV_cat(17,21.2] 3.970e-18 2.145e-17 1.850e-01 0.853275
## CMEDV_cat(21.2,25] 1.435e-19 2.794e-17 5.000e-03 0.995904
## CMEDV_cat(25,50] -1.298e-18 4.117e-17 -3.200e-02 0.974864
## NOX_cat(0.449,0.538] -9.161e-19 2.063e-17 -4.400e-02 0.964596
## NOX_cat(0.538,0.624] -5.790e-18 3.158e-17 -1.830e-01 0.854594
## NOX_cat(0.624,0.871] -2.251e-17 5.018e-17 -4.490e-01 0.653943
## RM_cat(5.89,6.21] 1.638e-17 1.607e-17 1.020e+00 0.308399
## RM_cat(6.21,6.62] 2.545e-17 2.099e-17 1.213e+00 0.225853
## RM_cat(6.62,8.78] 2.849e-17 3.182e-17 8.950e-01 0.371185
## lag_CMEDV 9.690e-19 2.152e-18 4.500e-01 0.652663
## lag_NOX 2.724e-16 1.966e-16 1.386e+00 0.166472
## lag_RM 2.102e-17 1.951e-17 1.077e+00 0.281920
## cluster_cmedvHigh-High -1.654e-17 2.278e-17 -7.260e-01 0.468168
## cluster_cmedvLow-Low -2.707e-17 2.218e-17 -1.221e+00 0.222817
## cluster_cmedvLow-High -1.022e-17 5.089e-17 -2.010e-01 0.840954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.979e-17 on 444 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.927e+34 on 44 and 444 DF, p-value: < 2.2e-16
AIC(modelo_ols)
## [1] -34600.37
# Modelo tradicional
modelo_ols2 <- lm(CMEDV ~ NOX + RM + LSTAT, data = boston_data)
summary(modelo_ols2)
##
## Call:
## lm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.510 -3.094 -0.649 2.186 27.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.76809 2.97198 1.941 0.052859 .
## NOX -7.59270 2.29257 -3.312 0.000996 ***
## RM 4.31356 0.40920 10.542 < 2e-16 ***
## LSTAT -0.52537 0.04535 -11.586 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.666 on 485 degrees of freedom
## Multiple R-squared: 0.6521, Adjusted R-squared: 0.6499
## F-statistic: 303 on 3 and 485 DF, p-value: < 2.2e-16
AIC(modelo_ols2)
## [1] 2900.176
alias_check <- alias(lm(CMEDV ~ ., data = boston_data))
alias_check$Complete
## (Intercept) LON LAT CRIM ZN INDUS CHAS1 NOX RM AGE DIS RAD TAX PTRATIO B
## co50k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## LSTAT units cu5k c5_7_5 C7_5_10 C10_15 C15_20 C20_25 C25_35 C35_50 median
## co50k 0 1 -1 -1 -1 -1 -1 -1 -1 -1 0
## BB NOX_ID POP CMEDV_num CMEDV_cat(17,21.2] CMEDV_cat(21.2,25]
## co50k 0 0 0 0 0 0
## CMEDV_cat(25,50] NOX_cat(0.449,0.538] NOX_cat(0.538,0.624]
## co50k 0 0 0
## NOX_cat(0.624,0.871] RM_cat(5.89,6.21] RM_cat(6.21,6.62]
## co50k 0 0 0
## RM_cat(6.62,8.78] lag_CMEDV lag_NOX lag_RM cluster_cmedvHigh-High
## co50k 0 0 0 0 0
## cluster_cmedvLow-Low cluster_cmedvLow-High
## co50k 0 0
# Elimina la variable colineal
boston_data_cl <- boston_data %>% select(-co50k)
length(listw$neighbours) == nrow(boston_data_cl)
## [1] TRUE
# Modelo SAR
modelo_sar <- lagsarlm(CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, listw = listw, zero.policy = TRUE)
summary(modelo_sar)
##
## Call:lagsarlm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl,
## listw = listw, zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.86338 -2.54148 -0.62026 1.73914 27.81993
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.01948 2.84202 -2.4699 0.01352
## NOX -2.65396 2.13216 -1.2447 0.21323
## RM 4.10679 0.36460 11.2638 < 2e-16
## LSTAT -0.36191 0.04233 -8.5498 < 2e-16
##
## Rho: 0.42387, LR test value: 101.73, p-value: < 2.22e-16
## Asymptotic standard error: 0.04033
## z-value: 10.51, p-value: < 2.22e-16
## Wald statistic: 110.46, p-value: < 2.22e-16
##
## Log likelihood: -1394.224 for lag model
## ML residual variance (sigma squared): 16.919, (sigma: 4.1132)
## Number of observations: 489
## Number of parameters estimated: 6
## AIC: 2800.4, (AIC for lm: 2900.2)
## LM test for residual autocorrelation
## test value: 24.462, p-value: 7.5794e-07
AIC(modelo_sar)
## [1] 2800.448
# Modelo SEM
modelo_sem <- errorsarlm(CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, listw = listw)
summary(modelo_sem)
##
## Call:errorsarlm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl,
## listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.48488 -2.26901 -0.44534 1.53407 23.86882
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.095860 2.983377 2.0433 0.04103
## NOX -15.835402 2.984003 -5.3068 1.116e-07
## RM 4.672360 0.363562 12.8516 < 2.2e-16
## LSTAT -0.373749 0.044078 -8.4792 < 2.2e-16
##
## Lambda: 0.65719, LR test value: 146.19, p-value: < 2.22e-16
## Asymptotic standard error: 0.043422
## z-value: 15.135, p-value: < 2.22e-16
## Wald statistic: 229.07, p-value: < 2.22e-16
##
## Log likelihood: -1371.992 for error model
## ML residual variance (sigma squared): 14.508, (sigma: 3.8089)
## Number of observations: 489
## Number of parameters estimated: 6
## AIC: 2756, (AIC for lm: 2900.2)
# Modelo Espacial Durbin
modelo_durbin <- lagsarlm(CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, listw = listw, type = "mixed")
summary(modelo_durbin)
##
## Call:lagsarlm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl,
## listw = listw, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.8059 -2.3094 -0.5183 1.5568 22.5557
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.055798 4.226959 1.6692 0.09507
## NOX -18.983790 3.611366 -5.2567 1.467e-07
## RM 4.736512 0.360675 13.1323 < 2.2e-16
## LSTAT -0.341935 0.045082 -7.5847 3.331e-14
## lag.NOX 22.651309 4.534202 4.9957 5.864e-07
## lag.RM -4.045986 0.670878 -6.0309 1.631e-09
## lag.LSTAT -0.053098 0.081117 -0.6546 0.51274
##
## Rho: 0.61635, LR test value: 133.2, p-value: < 2.22e-16
## Asymptotic standard error: 0.045869
## z-value: 13.437, p-value: < 2.22e-16
## Wald statistic: 180.56, p-value: < 2.22e-16
##
## Log likelihood: -1362.896 for mixed model
## ML residual variance (sigma squared): 14.179, (sigma: 3.7655)
## Number of observations: 489
## Number of parameters estimated: 9
## AIC: 2743.8, (AIC for lm: 2875)
## LM test for residual autocorrelation
## test value: 0.73372, p-value: 0.39168
Generar breve comparativa de estos modelos y elegir ¿cuál consideran qué es el mejor modelo? y ¿por qué?
library(stargazer)
stargazer(modelo_ols2, modelo_sar, modelo_sem, modelo_durbin,
type = "text", title = "Comparación de Modelos", align = TRUE)
##
## Comparación de Modelos
## =====================================================================================
## Dependent variable:
## -----------------------------------------------------------------
## CMEDV
## OLS spatial spatial spatial
## autoregressive error autoregressive
## (1) (2) (3) (4)
## -------------------------------------------------------------------------------------
## NOX -7.593*** -2.654 -15.835*** -18.984***
## (2.293) (2.132) (2.984) (3.611)
##
## RM 4.314*** 4.107*** 4.672*** 4.737***
## (0.409) (0.365) (0.364) (0.361)
##
## LSTAT -0.525*** -0.362*** -0.374*** -0.342***
## (0.045) (0.042) (0.044) (0.045)
##
## lag.NOX 22.651***
## (4.534)
##
## lag.RM -4.046***
## (0.671)
##
## lag.LSTAT -0.053
## (0.081)
##
## Constant 5.768* -7.019** 6.096** 7.056*
## (2.972) (2.842) (2.983) (4.227)
##
## -------------------------------------------------------------------------------------
## Observations 489 489 489 489
## R2 0.652
## Adjusted R2 0.650
## Log Likelihood -1,394.224 -1,371.992 -1,362.896
## sigma2 16.919 14.508 14.179
## Akaike Inf. Crit. 2,800.448 2,755.985 2,743.791
## Residual Std. Error 4.666 (df = 485)
## F Statistic 302.994*** (df = 3; 485)
## Wald Test (df = 1) 110.461*** 229.071*** 180.556***
## LR Test (df = 1) 101.728*** 146.191*** 133.198***
## =====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
El mejor modelo es el Durbin, es el que tiene el AIC más bajo de 2743.791, tiene el menor error residual, lo que sugiere que predice con mayor precisión.
Similar a las exposiciones de las Actividades 01 y 02, esta se llevará a cabo al iniciar la clase del Lunes y para esto pueden utilizar únicamente el archivo HTML kniteado a partir de su código RMD.
Sin embargo, la entrega final deberá incluir tanto tu código completamente ejecutado en archivo HTML, así como una breve presentación en la que se mencionen claramente los principales patrones o insights detectados en su análisis, así como sus conclusiones. Los archivos se deberán subir en Canvas por equipo a más tardar el día Viernes 25 de Abril en la sección habilitada por el profesor para esto, junto con los archivos correspondientes a las actividades 01 y 02.