Ejercicio Regresión Lineal Espacial

El propósito de esta actividad es poner en práctica lo estudiado hasta ahora en los temas de Análisis Exploratorio de Datos Espaciales, así como Modelos de Regresión Lineal Espaciales.

Primeramente aquí se proporcionan las librerías principales que deberás considerar. De todas maneras si deseas agregar cualquier otra librería lo puedes hacer y recomendaría hacerlo de una vez dentro del siguiente chunk de código:

#install.packages("dlookr")
library(foreign)
library(dplyr)
library(spdep)
library(tigris)
library(tidyr)
library(rgeoda)
library(RColorBrewer)
library(viridis)
library(ggplot2)
library(tmap)
library(sf) 
library(sp)
library(spatialreg)
library(stargazer)
library(plm)
library(splm)
library(pspatreg)
library(regclass)
library(mctest)
library(lmtest)
library(spData)
library(mapview)
library(naniar)
library(dlookr)
library(caret)
library(e1071)
library(SparseM)
library(Metrics)
library(randomForest)
library(rpart.plot)
library(insight)
library(jtools)
library(xgboost)
library(DiagrammeR)
library(effects)
library(leaflet)
library(GGally)

A continuación se carga la base de datos a ser analizada: ‘boston’ de la librería ‘spData’

data(boston, package="spData")
head(data)
##                                                                             
## 1 function (..., list = character(), package = NULL, lib.loc = NULL,        
## 2     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## 3 {                                                                         
## 4     fileExt <- function(x) {                                              
## 5         db <- grepl("\\\\.[^.]+\\\\.(gz|bz2|xz)$", x)                     
## 6         ans <- sub(".*\\\\.", "", x)
glimpse(data())
## List of 4
##  $ title  : chr "Data sets"
##  $ header : NULL
##  $ results: chr [1:541, 1:4] "GGally" "GGally" "GGally" "GGally" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "Package" "LibPath" "Item" "Title"
##  $ footer : chr "Use 'data(package = .packages(all.available = TRUE))'\nto list the data sets in all *available* packages."
##  - attr(*, "class")= chr "packageIQR"

Seguidamente, se cargan las geocercas asociadas a la data transversal previamente cargada

boston.tr <-st_read(system.file("shapes/boston_tracts.gpkg", package="spData")[1])
## Reading layer `boston_tracts' from data source 
##   `C:\Users\mari0\AppData\Local\R\win-library\4.4\spData\shapes\boston_tracts.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 506 features and 36 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## Geodetic CRS:  NAD27

Ejemplo de primer visualización en Mapas utilizando los datos ya cargados y visualizando los valores utilizando una variedad de paletas de colores.

boston.trSP<-as(boston.tr, "Spatial")
boston_nb<-poly2nb(boston.trSP) 
mapview(boston.trSP, zcol="CMEDV", col.regions = colorRampPalette(c('lightgrey', 'darkblue')))
mapview(boston.trSP, zcol="CMEDV", col.regions = brewer.pal(7,'Blues'))
## Warning: Found less unique colors (7) than unique zcol values (228)! 
## Interpolating color vector to match number of zcol values.
mapview(boston.trSP, zcol="CMEDV", col.regions = viridisLite::magma(20))
## Warning: Found less unique colors (20) than unique zcol values (228)! 
## Interpolating color vector to match number of zcol values.

La meta esencial del análisis que llevarás a cabo será explicar la variable ‘CRIM’ o ‘CMEDV’ de acuerdo a la paridad del número de su equipo: - Equipos Impares 1, 3, 5 y 7 explicarán ‘CRIM’ - Equipos Pares 2, 4, 6 explicarán ‘CMEDV’ (en este caso deberán de omitir la variable ‘MEDV’ de la base de datos)

Realiza un breve análisis exploratorio de datos tradicional. Cuestiones que se deberán de incluir en esta sección incluyen: - Histogramas - Matriz de Correlaciones - Boxplots

boston_data <- boston.tr %>% st_drop_geometry()

numeric_vars <- boston_data %>%
  select(where(is.numeric)) %>%
  select(-MEDV)

long_data <- pivot_longer(numeric_vars, everything(), names_to = "Variable", values_to = "Valor")
ggplot(long_data, aes(x = Valor)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  theme_minimal() +
  labs(title = "Histogramas de Variables Numéricas")
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_bin()`).

subset_vars <- numeric_vars %>%
  select(CMEDV, RM, NOX, DIS, PTRATIO, LSTAT)

ggpairs(subset_vars) + 
  theme_minimal() +
  ggtitle("Matriz de Correlaciones")

library(corrplot)
## corrplot 0.92 loaded
# Calcular matriz de correlación
cor_matrix <- cor(numeric_vars)

top_corrs <- as.data.frame(as.table(cor_matrix)) %>%
  filter(Var1 != Var2) %>%                     
  mutate(abs_corr = abs(Freq)) %>%            
  arrange(desc(abs_corr)) %>%                 
  distinct(abs_corr, .keep_all = TRUE) %>%    
  slice_head(n = 10) 

top_vars <- unique(c(top_corrs$Var1, top_corrs$Var2))

top10_data <- numeric_vars[, top_vars]

top10_corr_matrix <- cor(top10_data)

corrplot(top10_corr_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, title = "Top 10 Correlaciones", mar = c(0,0,2,0))

boston.tr$CMEDV <- as.factor(boston.tr$CMEDV)

# Boxplots para algunas variables contra CMEDV
ggplot(boston.tr, aes(x = CMEDV, y = RM, fill = CMEDV)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot de RM (habitaciones) según CMEDV", x = "Categoría de Valor Medio (CMEDV)", y = "Número de Habitaciones")

ggplot(boston.tr, aes(x = CMEDV, y = LSTAT, fill = CMEDV)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot de LSTAT (% población de bajos recursos) según CMEDV", x = "Categoría de Valor Medio (CMEDV)", y = "% LSTAT")

ggplot(boston.tr, aes(y = LSTAT)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(title = "Boxplot de LSTAT",
       y = "RM")

ggplot(boston.tr, aes(y = RM)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(title = "Boxplot de RM (promedio de habitaciones)",
       y = "RM")

mapview(boston.trSP, zcol="NOX", col.regions = colorRampPalette(c('lightgrey', 'darkgreen')))
mapview(boston.trSP, zcol="LSTAT", col.regions = colorRampPalette(c('lightgrey', 'darkorange')))

Realiza un análisis exploratorio de datos espaciales. Los puntos esenciales a generar en esta sección son los siguientes: - Definir claramente la matriz de conectividad espacial - Graficar dicha matriz, utilizar la estructura ‘Reina’ para determinar los vecinos - Realizar mapa coloreado de acuerdo al valor de la variable que se desea explicar y al menos 2 de las variables independientes que se utilizarán (mapa de 5 variables: high-high, low-low, high-low, low-high) - Realizar análisis de clusters: + Incluir mapa de un lag espacial de la variable que se desea explicar junto con el mapa original previamente realizado + Visualización Espacial de Clusters (HotSpots, ColdSpots, NonSignificant y Atípicos: HighLows, LowHighs) - Índice Global de Moran + Calcular el Estadístico del Índice Global de Moran para las variables previamente graficadas (al menos son 3: 1 dependiente, 2 independientes) + Generar sus respectivos gráficos de dispersión

swm  <- poly2nb(boston.tr, queen=T)

summary(swm)
## Neighbour list object:
## Number of regions: 506 
## Number of nonzero links: 2910 
## Percentage nonzero weights: 1.136559 
## Average number of links: 5.750988 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  15 
##   3   9  28  81 107 120  87  40  22   5   2   1   1 
## 3 least connected regions:
## 18 51 345 with 1 link
## 1 most connected region:
## 112 with 15 links
sswm <- nb2listw(swm, style="W", zero.policy = TRUE)

boston_a <- as(boston.tr, "Spatial")
boston_centroid <- coordinates(boston_a) 
plot(boston_a,border="blue",axes=FALSE,las=1, main="Boston's States Queen SWM")
plot(boston_a,col="grey",border=grey(0.9),axes=T,add=T) 
plot(sswm,coords=boston_centroid,pch=19,cex=0.1,col="red",add=T) 

# Clasificación cuantil para categorizar valores
boston.tr$CMEDV_num <- as.numeric(as.character(boston.tr$CMEDV))

boston.tr$CMEDV_cat <- cut(boston.tr$CMEDV_num, 
                           breaks = quantile(boston.tr$CMEDV_num, probs = seq(0, 1, 0.25), na.rm = TRUE), 
                           include.lowest = TRUE)
boston.tr$NOX_cat <- cut(boston.tr$NOX, breaks = quantile(boston.tr$NOX, probs = seq(0, 1, 0.25)), include.lowest = TRUE)
boston.tr$RM_cat <- cut(boston.tr$RM, breaks = quantile(boston.tr$RM, probs = seq(0, 1, 0.25)), include.lowest = TRUE)

# Mapas individuales
tm1 <- tm_shape(boston.tr) +
  tm_fill("CMEDV_cat", palette = "YlOrRd", title = "CMEDV (Target)") +
  tm_borders()
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
## visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
## tm_scale(<HERE>).
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm2 <- tm_shape(boston.tr) +
  tm_fill("NOX_cat", palette = "Purples", title = "NOX (Independent)") +
  tm_borders()
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm3 <- tm_shape(boston.tr) +
  tm_fill("RM_cat", palette = "Blues", title = "RM (Independent)") +
  tm_borders()
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
# Mostrar mapas en conjunto
tmap_arrange(tm1, tm2, tm3)
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "YlOrRd" is named
## "brewer.yl_or_rd"
## Multiple palettes called "yl_or_rd" found: "brewer.yl_or_rd", "matplotlib.yl_or_rd". The first one, "brewer.yl_or_rd", is returned.
## 
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Purples" is named
## "brewer.purples"
## Multiple palettes called "purples" found: "brewer.purples", "matplotlib.purples". The first one, "brewer.purples", is returned.
## 
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.

# Crear matriz de vecinos
neighbors <- poly2nb(boston.tr)
sswm <- nb2listw(neighbors, style = "W", zero.policy = TRUE)

boston.tr$CMEDV <- as.numeric(as.character(boston.tr$CMEDV))

# Lag espacial para CMEDV
boston.tr$lag_CMEDV <- lag.listw(sswm, boston.tr$CMEDV, zero.policy = TRUE)
boston.tr$lag_NOX <- lag.listw(sswm, boston.tr$NOX, zero.policy = TRUE)
boston.tr$lag_RM <- lag.listw(sswm, boston.tr$RM, zero.policy = TRUE)

# Mapa original + mapa de lag espacial
tm_original <- tm_shape(boston.tr) +
  tm_fill("CMEDV", palette = "YlOrRd", title = "CMEDV Original") +
  tm_borders()
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_fill()`: migrate the argument(s) related to the scale of the
## visual variable `fill` namely 'palette' (rename to 'values') to fill.scale =
## tm_scale(<HERE>).
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tm_lag <- tm_shape(boston.tr) +
  tm_fill("lag_CMEDV", palette = "Oranges", title = "Lag Espacial CMEDV") +
  tm_borders()
## [v3->v4] `tm_fill()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
tmap_arrange(tm_original, tm_lag)
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "YlOrRd" is named
## "brewer.yl_or_rd"
## Multiple palettes called "yl_or_rd" found: "brewer.yl_or_rd", "matplotlib.yl_or_rd". The first one, "brewer.yl_or_rd", is returned.
## 
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.

# LISA para CMEDV
neighbors_geo <- queen_weights(boston.tr)
lisa_result <- local_moran(neighbors_geo, data.frame(CMEDV = boston.tr$CMEDV))
#boston.tr$lisa_cluster <- as.factor(lisa_clusters(lisa_result))
boston.tr$cluster_cmedv <- as.factor(lisa_result$GetClusterIndicators())
levels(boston.tr$cluster_cmedv) <- lisa_result$GetLabels()

# Mapa de clusters
ggplot(data = boston.tr) +
  geom_sf(aes(fill = cluster_cmedv)) +
  scale_fill_manual(values = c("lightgrey", "#9F9FED", "#CAD593", "#2D936C", "skyblue")) + 
  theme_minimal() +
  labs(
    title = "Cluster LISA para CMEDV",
    subtitle = "Análisis Espacial Local en Boston",
    fill = "Tipo de Cluster"
  )

# Cálculo del índice para CMEDV, NOX y RM
boston.tr$CMEDV <- as.numeric(as.character(boston.tr$CMEDV))

moran_cmedv <- moran.test(boston.tr$CMEDV, sswm)
moran_nox <- moran.test(boston.tr$NOX, sswm)
moran_rm <- moran.test(boston.tr$RM, sswm)

# Mostrar resultados
print(moran_cmedv)
## 
##  Moran I test under randomisation
## 
## data:  boston.tr$CMEDV  
## weights: sswm    
## 
## Moran I statistic standard deviate = 23.558, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.6322686784     -0.0019801980      0.0007248376
print(moran_nox)
## 
##  Moran I test under randomisation
## 
## data:  boston.tr$NOX  
## weights: sswm    
## 
## Moran I statistic standard deviate = 33.692, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.9065225426     -0.0019801980      0.0007271085
print(moran_rm)
## 
##  Moran I test under randomisation
## 
## data:  boston.tr$RM  
## weights: sswm    
## 
## Moran I statistic standard deviate = 17.094, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4580618698     -0.0019801980      0.0007242995
# Gráficos de dispersión para Moran (espacial vs variable original)
m_cmedv <- lm(boston.tr$lag_CMEDV ~ boston.tr$CMEDV)

plot(boston.tr$lag_CMEDV ~ boston.tr$CMEDV,
     pch = 21, asp = 1, las = 1, col = "grey40", bg = "grey80",
     main = "CMEDV - Índice de Moran",
     xlab = "CMEDV", ylab = "Lag espacial de CMEDV")

abline(m_cmedv, col = "blue")
abline(v = mean(boston.tr$CMEDV, na.rm = TRUE), lty = 3, col = "grey60")
abline(h = mean(boston.tr$lag_CMEDV, na.rm = TRUE), lty = 3, col = "grey60")

# Repetir para NOX
m_nox <- lm(boston.tr$lag_NOX ~ boston.tr$NOX)

plot(boston.tr$lag_NOX ~ boston.tr$NOX,
     pch = 21, asp = 1, las = 1, col = "grey40", bg = "grey80",
     main = "NOX - Índice de Moran",
     xlab = "NOX", ylab = "Lag espacial de NOX")

abline(m_nox, col = "blue")
abline(v = mean(boston.tr$NOX, na.rm = TRUE), lty = 3, col = "grey60")
abline(h = mean(boston.tr$lag_NOX, na.rm = TRUE), lty = 3, col = "grey60")

# RM
m_rm <- lm(boston.tr$lag_RM ~ boston.tr$RM)

plot(boston.tr$lag_RM ~ boston.tr$RM,
     pch = 21, asp = 1, las = 1, col = "grey40", bg = "grey80",
     main = "RM - Índice de Moran",
     xlab = "RM", ylab = "Lag espacial de RM")

abline(m_rm, col = "blue")
abline(v = mean(boston.tr$RM, na.rm = TRUE), lty = 3, col = "grey60")
abline(h = mean(boston.tr$lag_RM, na.rm = TRUE), lty = 3, col = "grey60")

Por último deberás de ajustar los Modelos de Regresión estudiados en este Módulo: - Modelo de Regresión Lineal Tradicional - Modelo de Regresión Espacial AutoRegresivo (SAR) - Modelo de Regresión Espacial de Errores (SEM) - Modelo de Regresión Espacial Durbin

# Prepara los datos sin la variable MEDV
boston_data <- boston.tr %>% st_drop_geometry() %>% select(-MEDV)
summary(boston_data)
##    poltract             TOWN               TOWNNO          TRACT     
##  Length:506         Length:506         Min.   : 0.00   Min.   :   1  
##  Class :character   Class :character   1st Qu.:26.25   1st Qu.:1303  
##  Mode  :character   Mode  :character   Median :42.00   Median :3394  
##                                        Mean   :47.53   Mean   :2700  
##                                        3rd Qu.:78.00   3rd Qu.:3740  
##                                        Max.   :91.00   Max.   :5082  
##                                                                      
##       LON              LAT            CMEDV            CRIM         
##  Min.   :-71.29   Min.   :42.03   Min.   : 5.00   Min.   : 0.00632  
##  1st Qu.:-71.09   1st Qu.:42.18   1st Qu.:17.02   1st Qu.: 0.08205  
##  Median :-71.05   Median :42.22   Median :21.20   Median : 0.25651  
##  Mean   :-71.06   Mean   :42.22   Mean   :22.53   Mean   : 3.61352  
##  3rd Qu.:-71.02   3rd Qu.:42.25   3rd Qu.:25.00   3rd Qu.: 3.67708  
##  Max.   :-70.81   Max.   :42.38   Max.   :50.00   Max.   :88.97620  
##                                                                     
##        ZN             INDUS           CHAS                NOX        
##  Min.   :  0.00   Min.   : 0.46   Length:506         Min.   :0.3850  
##  1st Qu.:  0.00   1st Qu.: 5.19   Class :character   1st Qu.:0.4490  
##  Median :  0.00   Median : 9.69   Mode  :character   Median :0.5380  
##  Mean   : 11.36   Mean   :11.14                      Mean   :0.5547  
##  3rd Qu.: 12.50   3rd Qu.:18.10                      3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.74                      Max.   :0.8710  
##                                                                      
##        RM             AGE              DIS              RAD        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##                                                                    
##       TAX           PTRATIO            B              LSTAT      
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95  
##  Median :330.0   Median :19.05   Median :391.44   Median :11.36  
##  Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95  
##  Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
##                                                                  
##      units             cu5k            c5_7_5          C7_5_10       
##  Min.   :   5.0   Min.   : 0.000   Min.   : 0.000   Min.   :  0.000  
##  1st Qu.: 115.0   1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.:  2.000  
##  Median : 511.5   Median : 1.000   Median : 3.000   Median :  6.000  
##  Mean   : 680.8   Mean   : 2.921   Mean   : 5.534   Mean   :  9.984  
##  3rd Qu.:1152.0   3rd Qu.: 4.000   3rd Qu.: 7.000   3rd Qu.: 13.000  
##  Max.   :3031.0   Max.   :35.000   Max.   :70.000   Max.   :121.000  
##                                                                      
##      C10_15           C15_20          C20_25          C25_35      
##  Min.   :  0.00   Min.   :  0.0   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 14.00   1st Qu.: 19.0   1st Qu.: 13.0   1st Qu.:   7.0  
##  Median : 33.00   Median : 85.0   Median :101.5   Median :  95.0  
##  Mean   : 55.41   Mean   :141.8   Mean   :166.2   Mean   : 170.8  
##  3rd Qu.: 75.50   3rd Qu.:210.0   3rd Qu.:281.8   3rd Qu.: 292.0  
##  Max.   :520.00   Max.   :937.0   Max.   :723.0   Max.   :1189.0  
##                                                                   
##      C35_50           co50k            median            BB        
##  Min.   :  0.00   Min.   :  0.00   Min.   : 5600   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.:  0.00   1st Qu.:16800   1st Qu.: 0.200  
##  Median : 17.00   Median :  3.00   Median :21000   Median : 0.500  
##  Mean   : 82.74   Mean   : 45.44   Mean   :21749   Mean   : 6.082  
##  3rd Qu.: 97.00   3rd Qu.: 20.00   3rd Qu.:24700   3rd Qu.: 1.600  
##  Max.   :769.00   Max.   :980.00   Max.   :50000   Max.   :96.400  
##                                    NA's   :17                      
##    censored             NOX_ID           POP          CMEDV_num    
##  Length:506         Min.   : 1.00   Min.   :  434   Min.   : 5.00  
##  Class :character   1st Qu.:18.00   1st Qu.: 3697   1st Qu.:17.02  
##  Mode  :character   Median :44.00   Median : 5105   Median :21.20  
##                     Mean   :41.77   Mean   : 5340   Mean   :22.53  
##                     3rd Qu.:62.00   3rd Qu.: 6825   3rd Qu.:25.00  
##                     Max.   :96.00   Max.   :15976   Max.   :50.00  
##                                                                    
##      CMEDV_cat            NOX_cat            RM_cat      lag_CMEDV    
##  [5,17]   :127   [0.385,0.449]:129   [3.56,5.89]:127   Min.   : 5.70  
##  (17,21.2]:129   (0.449,0.538]:143   (5.89,6.21]:126   1st Qu.:18.18  
##  (21.2,25]:126   (0.538,0.624]:110   (6.21,6.62]:126   Median :22.40  
##  (25,50]  :124   (0.624,0.871]:124   (6.62,8.78]:127   Mean   :22.76  
##                                                        3rd Qu.:27.06  
##                                                        Max.   :50.00  
##                                                                       
##     lag_NOX           lag_RM              cluster_cmedv
##  Min.   :0.3930   Min.   :4.138   Not significant:306  
##  1st Qu.:0.4647   1st Qu.:6.004   High-High      : 89  
##  Median :0.5371   Median :6.263   Low-Low        :106  
##  Mean   :0.5551   Mean   :6.297   Low-High       :  5  
##  3rd Qu.:0.6387   3rd Qu.:6.577   High-Low       :  0  
##  Max.   :0.8710   Max.   :7.739   Undefined      :  0  
##                                   Isolated       :  0
boston_data <- boston_data %>%
  select(-poltract, -TOWN, -censored, -TOWNNO, -TRACT)

colSums(is.na(boston_data))
##           LON           LAT         CMEDV          CRIM            ZN 
##             0             0             0             0             0 
##         INDUS          CHAS           NOX            RM           AGE 
##             0             0             0             0             0 
##           DIS           RAD           TAX       PTRATIO             B 
##             0             0             0             0             0 
##         LSTAT         units          cu5k        c5_7_5       C7_5_10 
##             0             0             0             0             0 
##        C10_15        C15_20        C20_25        C25_35        C35_50 
##             0             0             0             0             0 
##         co50k        median            BB        NOX_ID           POP 
##             0            17             0             0             0 
##     CMEDV_num     CMEDV_cat       NOX_cat        RM_cat     lag_CMEDV 
##             0             0             0             0             0 
##       lag_NOX        lag_RM cluster_cmedv 
##             0             0             0
boston_data <- na.omit(boston_data)
nrow(boston_data)
## [1] 489
subset_indices <- which(complete.cases(boston_data))
boston_tr_clean <- boston.tr[subset_indices, ]

# Crear vecinos y lista de pesos para spdep
boston_nb <- poly2nb(boston_tr_clean)
listw <- nb2listw(boston_nb, style = "W", zero.policy = TRUE)
length(listw$neighbours)
## [1] 489
# Modelo tradicional
modelo_ols <- lm(CMEDV ~ ., data = boston_data)
summary(modelo_ols)
## Warning in summary.lm(modelo_ols): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = CMEDV ~ ., data = boston_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.035e-15 -7.800e-18  1.610e-18  1.217e-17  1.627e-16 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)             2.418e-13  9.587e-15  2.523e+01  < 2e-16 ***
## LON                     3.055e-15  8.904e-17  3.431e+01  < 2e-16 ***
## LAT                    -6.105e-16  1.569e-16 -3.891e+00 0.000115 ***
## CRIM                   -2.117e-17  8.835e-19 -2.396e+01  < 2e-16 ***
## ZN                      4.197e-19  3.447e-19  1.218e+00 0.223988    
## INDUS                   1.633e-17  1.645e-18  9.925e+00  < 2e-16 ***
## CHAS1                   1.158e-16  2.165e-17  5.350e+00 1.41e-07 ***
## NOX                    -1.035e-15  2.135e-16 -4.849e+00 1.72e-06 ***
## RM                      1.846e-16  1.930e-17  9.561e+00  < 2e-16 ***
## AGE                    -5.362e-19  3.709e-19 -1.446e+00 0.148917    
## DIS                    -6.700e-17  5.890e-18 -1.138e+01  < 2e-16 ***
## RAD                    -1.440e-17  1.889e-18 -7.622e+00 1.53e-13 ***
## TAX                     1.206e-18  8.893e-20  1.357e+01  < 2e-16 ***
## PTRATIO                -7.331e-18  3.422e-18 -2.143e+00 0.032683 *  
## B                      -1.107e-18  1.313e-19 -8.434e+00 4.75e-16 ***
## LSTAT                  -7.471e-18  1.425e-18 -5.242e+00 2.47e-07 ***
## units                  -5.303e-19  1.009e-19 -5.257e+00 2.28e-07 ***
## cu5k                    1.288e-18  1.733e-18  7.440e-01 0.457559    
## c5_7_5                  2.302e-18  1.281e-18  1.797e+00 0.072982 .  
## C7_5_10                -2.465e-19  1.091e-18 -2.260e-01 0.821348    
## C10_15                  5.467e-20  3.250e-19  1.680e-01 0.866482    
## C15_20                  1.390e-18  1.616e-19  8.598e+00  < 2e-16 ***
## C20_25                 -5.935e-19  1.789e-19 -3.318e+00 0.000982 ***
## C25_35                  1.324e-18  1.138e-19  1.163e+01  < 2e-16 ***
## C35_50                  2.166e-19  1.960e-19  1.105e+00 0.269674    
## co50k                          NA         NA         NA       NA    
## median                 -4.316e-19  3.062e-20 -1.410e+01  < 2e-16 ***
## BB                      2.510e-19  6.516e-19  3.850e-01 0.700301    
## NOX_ID                 -4.615e-21  5.474e-19 -8.000e-03 0.993277    
## POP                     1.354e-21  3.437e-21  3.940e-01 0.693719    
## CMEDV_num               1.000e+00  3.059e-17  3.269e+16  < 2e-16 ***
## CMEDV_cat(17,21.2]      3.970e-18  2.145e-17  1.850e-01 0.853275    
## CMEDV_cat(21.2,25]      1.435e-19  2.794e-17  5.000e-03 0.995904    
## CMEDV_cat(25,50]       -1.298e-18  4.117e-17 -3.200e-02 0.974864    
## NOX_cat(0.449,0.538]   -9.161e-19  2.063e-17 -4.400e-02 0.964596    
## NOX_cat(0.538,0.624]   -5.790e-18  3.158e-17 -1.830e-01 0.854594    
## NOX_cat(0.624,0.871]   -2.251e-17  5.018e-17 -4.490e-01 0.653943    
## RM_cat(5.89,6.21]       1.638e-17  1.607e-17  1.020e+00 0.308399    
## RM_cat(6.21,6.62]       2.545e-17  2.099e-17  1.213e+00 0.225853    
## RM_cat(6.62,8.78]       2.849e-17  3.182e-17  8.950e-01 0.371185    
## lag_CMEDV               9.690e-19  2.152e-18  4.500e-01 0.652663    
## lag_NOX                 2.724e-16  1.966e-16  1.386e+00 0.166472    
## lag_RM                  2.102e-17  1.951e-17  1.077e+00 0.281920    
## cluster_cmedvHigh-High -1.654e-17  2.278e-17 -7.260e-01 0.468168    
## cluster_cmedvLow-Low   -2.707e-17  2.218e-17 -1.221e+00 0.222817    
## cluster_cmedvLow-High  -1.022e-17  5.089e-17 -2.010e-01 0.840954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.979e-17 on 444 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.927e+34 on 44 and 444 DF,  p-value: < 2.2e-16
AIC(modelo_ols)
## [1] -34600.37
# Modelo tradicional
modelo_ols2 <- lm(CMEDV ~ NOX + RM + LSTAT, data = boston_data)
summary(modelo_ols2)
## 
## Call:
## lm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.510  -3.094  -0.649   2.186  27.217 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.76809    2.97198   1.941 0.052859 .  
## NOX         -7.59270    2.29257  -3.312 0.000996 ***
## RM           4.31356    0.40920  10.542  < 2e-16 ***
## LSTAT       -0.52537    0.04535 -11.586  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.666 on 485 degrees of freedom
## Multiple R-squared:  0.6521, Adjusted R-squared:  0.6499 
## F-statistic:   303 on 3 and 485 DF,  p-value: < 2.2e-16
AIC(modelo_ols2)
## [1] 2900.176
alias_check <- alias(lm(CMEDV ~ ., data = boston_data))
alias_check$Complete
##       (Intercept) LON LAT CRIM ZN INDUS CHAS1 NOX RM AGE DIS RAD TAX PTRATIO B 
## co50k  0           0   0   0    0  0     0     0   0  0   0   0   0   0       0
##       LSTAT units cu5k c5_7_5 C7_5_10 C10_15 C15_20 C20_25 C25_35 C35_50 median
## co50k  0     1    -1   -1     -1      -1     -1     -1     -1     -1      0    
##       BB NOX_ID POP CMEDV_num CMEDV_cat(17,21.2] CMEDV_cat(21.2,25]
## co50k  0  0      0   0         0                  0                
##       CMEDV_cat(25,50] NOX_cat(0.449,0.538] NOX_cat(0.538,0.624]
## co50k  0                0                    0                  
##       NOX_cat(0.624,0.871] RM_cat(5.89,6.21] RM_cat(6.21,6.62]
## co50k  0                    0                 0               
##       RM_cat(6.62,8.78] lag_CMEDV lag_NOX lag_RM cluster_cmedvHigh-High
## co50k  0                 0         0       0      0                    
##       cluster_cmedvLow-Low cluster_cmedvLow-High
## co50k  0                    0
# Elimina la variable colineal
boston_data_cl <- boston_data %>% select(-co50k)
length(listw$neighbours) == nrow(boston_data_cl)
## [1] TRUE
# Modelo SAR
modelo_sar <- lagsarlm(CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, listw = listw, zero.policy = TRUE)
summary(modelo_sar)
## 
## Call:lagsarlm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, 
##     listw = listw, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14.86338  -2.54148  -0.62026   1.73914  27.81993 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.01948    2.84202 -2.4699  0.01352
## NOX         -2.65396    2.13216 -1.2447  0.21323
## RM           4.10679    0.36460 11.2638  < 2e-16
## LSTAT       -0.36191    0.04233 -8.5498  < 2e-16
## 
## Rho: 0.42387, LR test value: 101.73, p-value: < 2.22e-16
## Asymptotic standard error: 0.04033
##     z-value: 10.51, p-value: < 2.22e-16
## Wald statistic: 110.46, p-value: < 2.22e-16
## 
## Log likelihood: -1394.224 for lag model
## ML residual variance (sigma squared): 16.919, (sigma: 4.1132)
## Number of observations: 489 
## Number of parameters estimated: 6 
## AIC: 2800.4, (AIC for lm: 2900.2)
## LM test for residual autocorrelation
## test value: 24.462, p-value: 7.5794e-07
AIC(modelo_sar)
## [1] 2800.448
# Modelo SEM
modelo_sem <- errorsarlm(CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, listw = listw)
summary(modelo_sem)
## 
## Call:errorsarlm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, 
##     listw = listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -17.48488  -2.26901  -0.44534   1.53407  23.86882 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value  Pr(>|z|)
## (Intercept)   6.095860   2.983377  2.0433   0.04103
## NOX         -15.835402   2.984003 -5.3068 1.116e-07
## RM            4.672360   0.363562 12.8516 < 2.2e-16
## LSTAT        -0.373749   0.044078 -8.4792 < 2.2e-16
## 
## Lambda: 0.65719, LR test value: 146.19, p-value: < 2.22e-16
## Asymptotic standard error: 0.043422
##     z-value: 15.135, p-value: < 2.22e-16
## Wald statistic: 229.07, p-value: < 2.22e-16
## 
## Log likelihood: -1371.992 for error model
## ML residual variance (sigma squared): 14.508, (sigma: 3.8089)
## Number of observations: 489 
## Number of parameters estimated: 6 
## AIC: 2756, (AIC for lm: 2900.2)
# Modelo Espacial Durbin
modelo_durbin <- lagsarlm(CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, listw = listw, type = "mixed")
summary(modelo_durbin)
## 
## Call:lagsarlm(formula = CMEDV ~ NOX + RM + LSTAT, data = boston_data_cl, 
##     listw = listw, type = "mixed")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8059  -2.3094  -0.5183   1.5568  22.5557 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value  Pr(>|z|)
## (Intercept)   7.055798   4.226959  1.6692   0.09507
## NOX         -18.983790   3.611366 -5.2567 1.467e-07
## RM            4.736512   0.360675 13.1323 < 2.2e-16
## LSTAT        -0.341935   0.045082 -7.5847 3.331e-14
## lag.NOX      22.651309   4.534202  4.9957 5.864e-07
## lag.RM       -4.045986   0.670878 -6.0309 1.631e-09
## lag.LSTAT    -0.053098   0.081117 -0.6546   0.51274
## 
## Rho: 0.61635, LR test value: 133.2, p-value: < 2.22e-16
## Asymptotic standard error: 0.045869
##     z-value: 13.437, p-value: < 2.22e-16
## Wald statistic: 180.56, p-value: < 2.22e-16
## 
## Log likelihood: -1362.896 for mixed model
## ML residual variance (sigma squared): 14.179, (sigma: 3.7655)
## Number of observations: 489 
## Number of parameters estimated: 9 
## AIC: 2743.8, (AIC for lm: 2875)
## LM test for residual autocorrelation
## test value: 0.73372, p-value: 0.39168

Generar breve comparativa de estos modelos y elegir ¿cuál consideran qué es el mejor modelo? y ¿por qué?

library(stargazer)
stargazer(modelo_ols2, modelo_sar, modelo_sem, modelo_durbin,
          type = "text", title = "Comparación de Modelos", align = TRUE)
## 
## Comparación de Modelos
## =====================================================================================
##                                            Dependent variable:                       
##                     -----------------------------------------------------------------
##                                                   CMEDV                              
##                               OLS               spatial      spatial      spatial    
##                                              autoregressive   error    autoregressive
##                               (1)                 (2)          (3)          (4)      
## -------------------------------------------------------------------------------------
## NOX                        -7.593***             -2.654     -15.835***   -18.984***  
##                             (2.293)             (2.132)      (2.984)      (3.611)    
##                                                                                      
## RM                          4.314***            4.107***     4.672***     4.737***   
##                             (0.409)             (0.365)      (0.364)      (0.361)    
##                                                                                      
## LSTAT                      -0.525***           -0.362***    -0.374***    -0.342***   
##                             (0.045)             (0.042)      (0.044)      (0.045)    
##                                                                                      
## lag.NOX                                                                  22.651***   
##                                                                           (4.534)    
##                                                                                      
## lag.RM                                                                   -4.046***   
##                                                                           (0.671)    
##                                                                                      
## lag.LSTAT                                                                  -0.053    
##                                                                           (0.081)    
##                                                                                      
## Constant                     5.768*             -7.019**     6.096**       7.056*    
##                             (2.972)             (2.842)      (2.983)      (4.227)    
##                                                                                      
## -------------------------------------------------------------------------------------
## Observations                  489                 489          489          489      
## R2                           0.652                                                   
## Adjusted R2                  0.650                                                   
## Log Likelihood                                 -1,394.224   -1,371.992   -1,362.896  
## sigma2                                           16.919       14.508       14.179    
## Akaike Inf. Crit.                              2,800.448    2,755.985    2,743.791   
## Residual Std. Error     4.666 (df = 485)                                             
## F Statistic         302.994*** (df = 3; 485)                                         
## Wald Test (df = 1)                             110.461***   229.071***   180.556***  
## LR Test (df = 1)                               101.728***   146.191***   133.198***  
## =====================================================================================
## Note:                                                     *p<0.1; **p<0.05; ***p<0.01

El mejor modelo es el Durbin, es el que tiene el AIC más bajo de 2743.791, tiene el menor error residual, lo que sugiere que predice con mayor precisión.


Similar a las exposiciones de las Actividades 01 y 02, esta se llevará a cabo al iniciar la clase del Lunes y para esto pueden utilizar únicamente el archivo HTML kniteado a partir de su código RMD.

Sin embargo, la entrega final deberá incluir tanto tu código completamente ejecutado en archivo HTML, así como una breve presentación en la que se mencionen claramente los principales patrones o insights detectados en su análisis, así como sus conclusiones. Los archivos se deberán subir en Canvas por equipo a más tardar el día Viernes 25 de Abril en la sección habilitada por el profesor para esto, junto con los archivos correspondientes a las actividades 01 y 02.