SUPERIVISED MACHINE LEARNING

What is Supervised Machine Learning?

1) It is the use of labeled datasets to train algorithms to either classify data or predict outcomes accurately (IBM).

2) SML uses a training dataset to teach models to yield a desiered output. The training dataset allows the model to learn over time.

3) SML is separated into two types of data mining: regression and classification.

References: What is Supervised Machine Learning?

Dataset

1) The Columbus Ohio Spatial Analysis Dataset is a dataframe of 49 rows and 22 columns. Each row corresponds to the 49 neighborhoods in Columbus, Ohio.

2) It is a real estate dataset which focuses on predicting the housing value.

3) The library() includes the requireed shapefile (electronic map) to visualize data patterns across Columbus, OH.

lets upload the required libraries

data manipulation & data visualization

library(foreign)        # Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase'
library(ggplot2)        # It is a system for creating graphics
## Warning: package 'ggplot2' was built under R version 4.3.1
library(dplyr)          # A fast, consistent tool for working with data frame like objects
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(mapview)        # Quickly and conveniently create interactive visualizations of spatial data with or without background maps
## Warning: package 'mapview' was built under R version 4.3.1
library(naniar)         # Provides data structures and functions that facilitate the plotting of missing values and examination of imputations.
#library(maptools)       # A collection of functions to create spatial weights matrix objects from polygon 'contiguities', for summarizing these objects, and for permitting their use in spatial data analysis
library(tmap)           # For drawing thematic maps
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(RColorBrewer)   # It offers several color palettes 
library(dlookr)         # A collection of tools that support data diagnosis, exploration, and transformation
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
## 
##     transform
# predictive modeling
library(regclass)       # Contains basic tools for visualizing, interpreting, and building regression models
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Warning: package 'VGAM' was built under R version 4.3.1
## Loading required package: stats4
## Loading required package: splines
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
library(mctest)         # Multicollinearity diagnostics
library(lmtest)         # Testing linear regression models
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'lmtest'
## The following object is masked from 'package:VGAM':
## 
##     lrtest
library(spdep)          # A collection of functions to create spatial weights matrix objects from polygon 'contiguities', for summarizing these objects, and for permitting their use in spatial data analysis
## Warning: package 'spdep' was built under R version 4.3.1
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.3.1
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(sf)             # A standardized way to encode spatial vector data
library(spData)         # Diverse spatial datasets for demonstrating, benchmarking and teaching spatial data analysis
library(spatialreg)     # A collection of all the estimation functions for spatial cross-sectional models
## Warning: package 'spatialreg' was built under R version 4.3.1
## Loading required package: Matrix
## 
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
## 
##     get.ClusterOption, get.coresOption, get.mcOption,
##     get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
##     set.coresOption, set.mcOption, set.VerboseOption,
##     set.ZeroPolicyOption
library(caret)          # The caret package (short for Classification And Rgression Training) contains functions to streamline the model training process for complex regression and classification problems.
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.1
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:regclass':
## 
##     qq
## 
## Attaching package: 'caret'
## The following object is masked from 'package:VGAM':
## 
##     predictors
library(e1071)          # Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, generalized k-nearest neighbor.
## 
## Attaching package: 'e1071'
## The following objects are masked from 'package:dlookr':
## 
##     kurtosis, skewness
library(SparseM)        # Provides some basic R functionality for linear algebra with sparse matrices
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
library(Metrics)        # An implementation of evaluation metrics in R that are commonly used in supervised machine learning
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
library(randomForest)   # Classification and regression based on a forest of trees using random inputs 
library(jtools)         # This is a collection of tools for more efficiently understanding and sharing the results of (primarily) regression analyses 
library(xgboost)        # The package includes efficient linear model solver and tree learning algorithms
## Warning: package 'xgboost' was built under R version 4.3.1
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(DiagrammeR)     # Build graph/network structures using functions for stepwise addition and deletion of nodes and edges 
## Warning: package 'DiagrammeR' was built under R version 4.3.1
library(effects)        # Graphical and tabular effect displays, e.g., of interactions, for various statistical models with linear predictors
## Loading required package: carData
## Registered S3 method overwritten by 'survey':
##   method      from  
##   summary.pps dlookr
## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.
library(shinyjs)
## 
## Attaching package: 'shinyjs'
## The following object is masked from 'package:Matrix':
## 
##     show
## The following object is masked from 'package:lmtest':
## 
##     reset
## The following object is masked from 'package:VGAM':
## 
##     show
## The following object is masked from 'package:stats4':
## 
##     show
## The following objects are masked from 'package:methods':
## 
##     removeClass, show
library(sp)
## Warning: package 'sp' was built under R version 4.3.1
#library(geoR)
library(gstat)
library(caret)

lets upload the required dataset

columbus <- st_read(system.file("shapes/columbus.shp", package="spData")[1])
## Reading layer `columbus' from data source 
##   `/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/spData/shapes/columbus.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS:           NA
col.gal.nb  <- read.gal(system.file("etc/weights/columbus.gal", package="spdep"))
columbus_sf <- read_sf(system.file("etc/shapes/columbus.shp", package="spdep"))

lets visualize the variables distribution across Columbus, Ohio

Map of Columbus, Ohio

tm_shape(columbus_sf) + tm_polygons(col='wheat') + 
  tm_style("classic") + 
  tm_text(text='POLYID',size=0.7)
## Warning: Currect projection of shape columbus_sf unknown. Long-lat (WGS84) is
## assumed.

Mapping the main variable of interest (HOVAL: housing value in $1,000)

map option # 1

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(columbus_sf) + 
  tm_fill("HOVAL", style="quantile", title = "House Prices (Quantile)") + 
  tm_layout(main.title = "Columbus, Ohio", legend.position = c("left", "top"), 
            legend.title.size = 0.8, legend.text.size = 0.7)
## Warning: Currect projection of shape columbus_sf unknown. Long-lat (WGS84) is
## assumed.
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.

map option # 2

ggplot(data = columbus_sf) +
  geom_sf(aes(fill = HOVAL)) +
  ggtitle(label = "Columbus, Ohio", subtitle = "House Prices in $1,000")

Mapping some explanatory variables

tmap_mode("plot")
## tmap mode set to plotting

Take a look of a palette of colors to display a map

tmaptools::palette_explorer()
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents
income_map <- tm_shape(columbus_sf) + 
  tm_fill("INC", palette = "Blues", style = "quantile", title = "Income") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

distance_map <- tm_shape(columbus_sf) + 
  tm_fill("DISCBD", palette = "BuPu", style = "quantile", title = "Distance to CBD") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

tmap_arrange(income_map,distance_map,nrow=1)
## Warning: Currect projection of shape columbus_sf unknown. Long-lat (WGS84) is
## assumed.

## Warning: Currect projection of shape columbus_sf unknown. Long-lat (WGS84) is
## assumed.

to estimate a spatial regression analysis it is required to build a spatial matrix that connects the neighborhoods across Columbus, Ohio

#map_centroid <- coordinates(columbus) 
map.linkW    <- nb2listw(col.gal.nb, style="W")   
plot(columbus,border="blue",axes=FALSE,las=1, main="Columbus Ohio - Spatial Connectivity Matrix")
## Warning: plotting the first 9 out of 20 attributes; use max.plot = 20 to plot
## all
plot(columbus,col="grey",border=grey(0.9),axes=T,add=T) 
## Warning in plot.sf(columbus, col = "grey", border = grey(0.9), axes = T, :
## ignoring all but the first attribute

#plot(map.linkW,coords=map_centroid,pch=19,cex=0.1,col="red",add=T) 

is it required to estimate a spatial regression model?

what is the global moran’s index? how to interpret the global moran’s index?

moran.test(columbus$HOVAL, listw = map.linkW, zero.policy = TRUE, na.action = na.omit)     
## 
##  Moran I test under randomisation
## 
## data:  columbus$HOVAL  
## weights: map.linkW    
## 
## Moran I statistic standard deviate = 2.1001, p-value = 0.01786
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.173645208      -0.020833333       0.008575953
# Ho: data are randomly distributed across space                                              
# Ha: clusters of data observations might be displayed across space

CROSS - VALIDATION DATASET

What is cross-validation? It is a statistical method to evaluate and compare learning algorithms by dividing data into two segments:

one used to learn or train a model and the other used to validate or test the model

(Refaelizageh, Tang, and Liu, 2009).

columbus_data <- st_drop_geometry(columbus)

Lets split data into training and test sets the training set is used to build the model and the test set to evaluate its predictive accuracy.

set.seed(123) # What is set.seed()? We want to make sure that we get the same results for randomization each time you run the script.   
partition <- createDataPartition(y = columbus_data$INC, p=0.7, list=F)
train = columbus_data[partition, ]
test  = columbus_data[-partition, ]

OLS

1) OLS stands for Ordinary Least Squares. OLS is an estimation technique for estimating coefficients of linear regression.

2) OLS estimation technique consists in minimizing the sum of squared differences between observed and predicted values.

3) In other words, OLS estimation technique aims to minimize the prediction error between the predicted and the observaed values.

ols_model <- lm(HOVAL ~ INC + CRIME + OPEN + PLUMB + DISCBD + EW, data = columbus_data)
summary(ols_model)
## 
## Call:
## lm(formula = HOVAL ~ INC + CRIME + OPEN + PLUMB + DISCBD + EW, 
##     data = columbus_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.528  -7.594  -3.516   4.516  54.171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  33.3415    15.7111   2.122   0.0398 *
## INC           0.1983     0.5413   0.366   0.7159  
## CRIME        -0.4842     0.2127  -2.276   0.0280 *
## OPEN          0.5697     0.4654   1.224   0.2278  
## PLUMB         1.7626     0.7405   2.380   0.0219 *
## DISCBD        4.1607     2.4393   1.706   0.0954 .
## EW            2.7720     4.6952   0.590   0.5581  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.4 on 42 degrees of freedom
## Multiple R-squared:  0.468,  Adjusted R-squared:  0.392 
## F-statistic: 6.157 on 6 and 42 DF,  p-value: 0.0001079
log_ols_model <- lm(log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN +0.01) + log(PLUMB) + log(DISCBD) + EW, data = columbus_data)
summary(log_ols_model)
## 
## Call:
## lm(formula = log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN + 
##     0.01) + log(PLUMB) + log(DISCBD) + EW, data = columbus_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43202 -0.18759 -0.04296  0.11548  0.92501 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.079652   0.532676   3.904 0.000337 ***
## log(INC)          0.600896   0.164719   3.648 0.000724 ***
## log(CRIME)       -0.147720   0.044053  -3.353 0.001700 ** 
## log(OPEN + 0.01)  0.005243   0.020268   0.259 0.797142    
## log(PLUMB)        0.245902   0.076432   3.217 0.002494 ** 
## log(DISCBD)       0.426722   0.125104   3.411 0.001442 ** 
## EW                0.032330   0.097511   0.332 0.741872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3017 on 42 degrees of freedom
## Multiple R-squared:  0.5741, Adjusted R-squared:  0.5133 
## F-statistic: 9.436 on 6 and 42 DF,  p-value: 1.465e-06
AIC(ols_model)      # AIC = 317.48
## [1] 408.8867
AIC(log_ols_model)  # AIC = 28.85
## [1] 30.06762
RMSE_ols_model     <- sqrt(mean(ols_model$residuals^2))
RMSE_log_ols_model <- sqrt(mean(log_ols_model$residuals^2))

Spatial Distribution Regression Residuals

columbus$reg_residuals <- log_ols_model$residuals
columbus$fitted        <- exp(log_ols_model$fitted.values)

### summary(columbus)

map_residuals <- tm_shape(columbus) + 
  tm_fill("reg_residuals", palette = "PuRd", style = "quantile", title = "log OLS Residuals") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

### Observed vs Predicted Values 
tmap_mode("plot")
## tmap mode set to plotting
observed <- tm_shape(columbus) + 
  tm_fill("HOVAL", palette = "Oranges", style = "quantile", title = "HOVAL") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

fitted <- tm_shape(columbus) + 
  tm_fill("fitted", palette = "Oranges", style = "quantile", title = "Fitted HOVAL") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

tmap_arrange(observed,fitted,nrow=1)
## Warning: Currect projection of shape columbus unknown. Long-lat (WGS84) is
## assumed.

## Warning: Currect projection of shape columbus unknown. Long-lat (WGS84) is
## assumed.

SAR

1) SAR stands for Spatial Autoregressive Model. SAR is a spatial model specification which includes as an explanatory variable the spatial lag of the dependent variable.

2) If the dependent variable displays clustering of similar / dissimilar values across the geographic unit of analysis then it is required to specify the spatial lag of the dependent variable.

3) The specification of the spatial lag of the dependent variable might significantly improve the estimated regression results whereas improving model accuracy.

sar_model <- lagsarlm(log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN +0.01) + log(PLUMB) + log(DISCBD) + EW, data=columbus, map.linkW, method="Matrix")
## Warning in .local(x, logarithm, ...): the default value of argument 'sqrt' of
## method 'determinant(<CHMfactor>, <logical>)' may change from TRUE to FALSE as
## soon as the next release of Matrix; set 'sqrt' when programming
# Assuming HOVAL is your response variable, we first need to compute the predicted values
columbus$predicted <- predict(sar_model)
## This method assumes the response is known - see manual page
# Calculate the residuals
columbus$residuals <- columbus$HOVAL - columbus$predicted

# Calculate RMSE
RMSE_SAR <- sqrt(mean(columbus$residuals^2))

summary(sar_model)
## 
## Call:lagsarlm(formula = log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN + 
##     0.01) + log(PLUMB) + log(DISCBD) + EW, data = columbus, listw = map.linkW, 
##     method = "Matrix")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.415234 -0.196420 -0.035814  0.112857  0.912480 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                    Estimate Std. Error z value  Pr(>|z|)
## (Intercept)       1.9135339  0.6812155  2.8090 0.0049696
## log(INC)          0.5818087  0.1549324  3.7552 0.0001732
## log(CRIME)       -0.1485391  0.0407521 -3.6449 0.0002675
## log(OPEN + 0.01)  0.0061916  0.0187851  0.3296 0.7417020
## log(PLUMB)        0.2367958  0.0717860  3.2986 0.0009716
## log(DISCBD)       0.4016667  0.1309609  3.0671 0.0021617
## EW                0.0306734  0.0901117  0.3404 0.7335605
## 
## Rho: 0.068092, LR test value: 0.14613, p-value: 0.70226
## Asymptotic standard error: 0.17048
##     z-value: 0.39941, p-value: 0.68959
## Wald statistic: 0.15953, p-value: 0.68959
## 
## Log likelihood: -6.960748 for lag model
## ML residual variance (sigma squared): 0.077707, (sigma: 0.27876)
## Number of observations: 49 
## Number of parameters estimated: 9 
## AIC: 31.921, (AIC for lm: 30.068)
## LM test for residual autocorrelation
## test value: 3.076, p-value: 0.079455
#RMSE_SAR <- sqrt(mean((columbus - sar_model)^2))

SEM

1) SEM stands for Spatial Error Model. SEM is a spatial model specification which includes the spatial lag of the error term (regression residuals) with the aim to confirm the misspecification of the regression model.

2) SEM is a useful regression model specification when the estimated regression residuals of the baseline model display clustering / agglomeration across the geographic unit of analysis.

3) If the estimated regression model residuals display clustering / agglomeration across the geographic unit of analysis then there is a misspecification of the regression model.

sem_model <- errorsarlm(log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN +0.01) + log(PLUMB) + log(DISCBD) + EW, data=columbus, map.linkW, method="Matrix")
# Assuming HOVAL is your response variable, we first need to compute the predicted values
columbus$predicted <- predict(sar_model)
## This method assumes the response is known - see manual page
# Calculate the residuals
columbus$residuals <- columbus$HOVAL - columbus$predicted

# Calculate RMSE
RMSE_SEM <- sqrt(mean(columbus$residuals^2))

summary(sem_model)
## 
## Call:errorsarlm(formula = log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN + 
##     0.01) + log(PLUMB) + log(DISCBD) + EW, data = columbus, listw = map.linkW, 
##     method = "Matrix")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.464112 -0.167533 -0.056421  0.096777  0.921712 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                    Estimate Std. Error z value  Pr(>|z|)
## (Intercept)       1.9574013  0.4827496  4.0547 5.020e-05
## log(INC)          0.6436862  0.1472091  4.3726 1.228e-05
## log(CRIME)       -0.1505657  0.0411514 -3.6588 0.0002534
## log(OPEN + 0.01)  0.0039052  0.0188312  0.2074 0.8357152
## log(PLUMB)        0.2685290  0.0670014  4.0078 6.128e-05
## log(DISCBD)       0.4389545  0.1079582  4.0660 4.783e-05
## EW                0.0448953  0.0785217  0.5718 0.5674873
## 
## Lambda: -0.19851, LR test value: 0.65105, p-value: 0.41974
## Asymptotic standard error: 0.21605
##     z-value: -0.91884, p-value: 0.35818
## Wald statistic: 0.84426, p-value: 0.35818
## 
## Log likelihood: -6.708288 for error model
## ML residual variance (sigma squared): 0.076342, (sigma: 0.2763)
## Number of observations: 49 
## Number of parameters estimated: 9 
## AIC: 31.417, (AIC for lm: 30.068)
#RMSE_SAR <- sqrt(mean((columbus_data - sem_model)^2))

XGBoost Regression

NOTE: The estimation regression method XGBoost are sensitive to specifying commands such as log() and I()^2 in the regression equation so we will do the data transformation and directly include it the dataset.

columbus_data_alt <- columbus_data %>% select(HOVAL, INC, CRIME, OPEN, PLUMB, DISCBD, EW)

columbus_data_alt$INC      <- log(columbus_data_alt$INC)
columbus_data_alt$CRIME    <- log(columbus_data_alt$CRIME)
columbus_data_alt$OPEN     <- ((columbus_data_alt$OPEN) + 0.01)
columbus_data_alt$OPEN     <- log(columbus_data_alt$OPEN)
columbus_data_alt$PLUMB    <- log(columbus_data_alt$PLUMB)
columbus_data_alt$DISCBD   <- log(columbus_data_alt$DISCBD)

summary(columbus_data_alt)
##      HOVAL            INC            CRIME             OPEN         
##  Min.   :17.90   Min.   :1.499   Min.   :-1.724   Min.   :-4.60517  
##  1st Qu.:25.70   1st Qu.:2.299   1st Qu.: 2.998   1st Qu.:-1.30998  
##  Median :33.50   Median :2.594   Median : 3.526   Median : 0.01599  
##  Mean   :38.44   Mean   :2.591   Mean   : 3.297   Mean   :-0.54135  
##  3rd Qu.:43.30   3rd Qu.:2.908   3rd Qu.: 3.883   3rd Qu.: 1.37281  
##  Max.   :96.40   Max.   :3.436   Max.   : 4.233   Max.   : 3.21920  
##      PLUMB              DISCBD              EW        
##  Min.   :-2.01934   Min.   :-0.9943   Min.   :0.0000  
##  1st Qu.:-1.10157   1st Qu.: 0.5306   1st Qu.:0.0000  
##  Median : 0.02361   Median : 0.9821   Median :1.0000  
##  Mean   : 0.03361   Mean   : 0.8864   Mean   :0.5918  
##  3rd Qu.: 0.92991   3rd Qu.: 1.3584   3rd Qu.:1.0000  
##  Max.   : 2.93445   Max.   : 1.7174   Max.   :1.0000
set.seed(123) # What is set.seed()? We want to make sure that we get the same results for randomization each time you run the script.   
cv_data   <- createDataPartition(y = columbus_data_alt$INC, p=0.7, list=F)
cv_train = columbus_data_alt[cv_data, ]
cv_test = columbus_data_alt[-cv_data, ]

define explanatory variables (X’s) and dependent variable (Y) in training set

train_x = data.matrix(cv_train[, -1])
train_y = cv_train[,1]

define explanatory variables (X’s) and dependent variable (Y) in testing set

test_x = data.matrix(cv_test[, -1])
test_y = cv_test[, 1]

define final training and testing sets

xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test  = xgb.DMatrix(data = test_x, label = test_y)

Lets fit XGBoost regression model and display RMSE for both training and testing data at each round

watchlist = list(train=xgb_train, test=xgb_test)
model_xgb = xgb.train(data=xgb_train, max.depth=3, watchlist=watchlist, nrounds=70) # the more the number of rounds selected, the longer the time to display the results. 
## [1]  train-rmse:32.558935    test-rmse:28.951619 
## [2]  train-rmse:25.355109    test-rmse:22.671456 
## [3]  train-rmse:20.152249    test-rmse:18.730544 
## [4]  train-rmse:16.339010    test-rmse:16.795286 
## [5]  train-rmse:13.500239    test-rmse:16.129408 
## [6]  train-rmse:11.332169    test-rmse:16.509411 
## [7]  train-rmse:9.637591 test-rmse:17.161065 
## [8]  train-rmse:8.354674 test-rmse:17.943420 
## [9]  train-rmse:7.301888 test-rmse:18.285487 
## [10] train-rmse:6.493047 test-rmse:18.437669 
## [11] train-rmse:5.783834 test-rmse:19.281797 
## [12] train-rmse:5.210777 test-rmse:20.024676 
## [13] train-rmse:4.766990 test-rmse:20.409095 
## [14] train-rmse:4.216826 test-rmse:20.960376 
## [15] train-rmse:3.892447 test-rmse:21.462003 
## [16] train-rmse:3.594895 test-rmse:21.755089 
## [17] train-rmse:3.117573 test-rmse:22.122776 
## [18] train-rmse:2.912595 test-rmse:22.249679 
## [19] train-rmse:2.744708 test-rmse:22.289308 
## [20] train-rmse:2.528766 test-rmse:22.453613 
## [21] train-rmse:2.428361 test-rmse:22.469520 
## [22] train-rmse:2.245889 test-rmse:22.488167 
## [23] train-rmse:1.995308 test-rmse:22.695846 
## [24] train-rmse:1.892313 test-rmse:22.762292 
## [25] train-rmse:1.706430 test-rmse:22.803818 
## [26] train-rmse:1.530626 test-rmse:22.917683 
## [27] train-rmse:1.462334 test-rmse:22.956625 
## [28] train-rmse:1.307067 test-rmse:22.853381 
## [29] train-rmse:1.195997 test-rmse:22.904000 
## [30] train-rmse:1.114445 test-rmse:22.882855 
## [31] train-rmse:1.033702 test-rmse:22.894206 
## [32] train-rmse:0.930637 test-rmse:22.885834 
## [33] train-rmse:0.855417 test-rmse:22.954493 
## [34] train-rmse:0.777327 test-rmse:22.989968 
## [35] train-rmse:0.694897 test-rmse:23.009517 
## [36] train-rmse:0.640067 test-rmse:23.024204 
## [37] train-rmse:0.585913 test-rmse:23.046855 
## [38] train-rmse:0.544947 test-rmse:23.103338 
## [39] train-rmse:0.496445 test-rmse:23.130836 
## [40] train-rmse:0.445176 test-rmse:23.174879 
## [41] train-rmse:0.423205 test-rmse:23.195423 
## [42] train-rmse:0.394193 test-rmse:23.194732 
## [43] train-rmse:0.361441 test-rmse:23.219765 
## [44] train-rmse:0.338051 test-rmse:23.241007 
## [45] train-rmse:0.303142 test-rmse:23.258915 
## [46] train-rmse:0.285972 test-rmse:23.256439 
## [47] train-rmse:0.265118 test-rmse:23.277356 
## [48] train-rmse:0.239570 test-rmse:23.290082 
## [49] train-rmse:0.223512 test-rmse:23.306106 
## [50] train-rmse:0.202481 test-rmse:23.314676 
## [51] train-rmse:0.192101 test-rmse:23.326943 
## [52] train-rmse:0.180221 test-rmse:23.315882 
## [53] train-rmse:0.168303 test-rmse:23.313811 
## [54] train-rmse:0.152294 test-rmse:23.322716 
## [55] train-rmse:0.139646 test-rmse:23.332230 
## [56] train-rmse:0.128065 test-rmse:23.340492 
## [57] train-rmse:0.118687 test-rmse:23.349151 
## [58] train-rmse:0.112001 test-rmse:23.356265 
## [59] train-rmse:0.103464 test-rmse:23.349631 
## [60] train-rmse:0.098326 test-rmse:23.355807 
## [61] train-rmse:0.088950 test-rmse:23.361121 
## [62] train-rmse:0.083060 test-rmse:23.356842 
## [63] train-rmse:0.079157 test-rmse:23.355080 
## [64] train-rmse:0.072423 test-rmse:23.357653 
## [65] train-rmse:0.069043 test-rmse:23.357478 
## [66] train-rmse:0.063032 test-rmse:23.356953 
## [67] train-rmse:0.059618 test-rmse:23.358380 
## [68] train-rmse:0.055434 test-rmse:23.354991 
## [69] train-rmse:0.052059 test-rmse:23.352430 
## [70] train-rmse:0.047479 test-rmse:23.356072
# Looks like the lowest RMSE for both training and test dataset is achieved at 59 round. 
# Lets estimate our final regression model
reg_xgb = xgboost(data = xgb_train, max.depth = 3, nrounds = 59, verbose = 0) # setting verbose = 0 avoids to display the training and testing error for each round. 
prediction_xgb_test<-predict(reg_xgb, xgb_test)
rmse(prediction_xgb_test, cv_test$HOVAL)
## [1] 23.34963
# Lets do some diagnostic check of regression residuals 
xgb_reg_residuals<-cv_test$HOVAL - prediction_xgb_test
plot(xgb_reg_residuals, xlab= "Dependent Variable", ylab = "Residuals", main = 'XGBoost Regression Residuals')
abline(0,0)

# Plot first 3 trees of model
xgb.plot.tree(model=reg_xgb, trees=0:2)
importance_matrix <- xgb.importance(model = reg_xgb)
xgb.plot.importance(importance_matrix, xlab = "Explanatory Variables X's Importance")

Suggested Readings

1) Understanding Linear Regression Output in R

2) How to run and interpret simple regression models in R

3) What is Supervised Learning?

4) Supervised vs. Unsupervised Learning

5) Maps of Meaning: Why You Need to Study Spatial Statistics as a Data Scientist

Exploratory Data Analysis

  1. Visualización de Datos: La creación de mapas utilizando tm_shape y ggplot para visualizar los precios de las casas (HOVAL) en Columbus, Ohio, permite identificar patrones espaciales y áreas de interés. Las advertencias sobre la proyección desconocida sugieren la necesidad de especificar la proyección de los datos para análisis espaciales precisos.

  2. Análisis Preliminar: La exploración inicial muestra una variabilidad significativa en los precios de las viviendas, lo cual es crucial para comprender la dinámica del mercado inmobiliario en diferentes barrios.

Diagnostic Tests

  1. Errores en Coordenadas: El error al intentar extraer coordenadas indica que la función coordinates no es compatible con objetos de tipo sf. Considera usar st_coordinates para objetos sf.

  2. Pruebas de Modelo: Los valores de AIC para los modelos ols_model y log_ols_model sugieren que el modelo logarítmico proporciona un mejor ajuste, indicando la importancia de transformaciones logarítmicas para algunas variables.

Main Insights

  1. Influencia de Variables: La regresión espacial muestra que el ingreso y la criminalidad tienen efectos significativos en los precios de las viviendas. La importancia de estas variables destaca la relevancia socioeconómica y de seguridad en la valoración de propiedades.

  2. Modelado Espacial: Los modelos espaciales, como el modelo SAR y SEM, demuestran la autocorrelación espacial en los datos. Aunque el valor de Rho en el modelo SAR y Lambda en el modelo SEM no son altamente significativos (p-value > 0.05), indican que hay una dependencia espacial leve que debe considerarse en análisis más detallados.

  3. Predicción y Residuos: La predicción de precios de viviendas utilizando XGBoost y la evaluación de residuos destacan la capacidad del modelo para capturar la variabilidad en los datos, aunque el aumento del RMSE con más rondas de entrenamiento sugiere sobreajuste.

---
title: "Actividad Sugerida 1"
author: "Genaro Rodríguez Alcántara - A00833172"
date: "2024-02-28"
output: 
  html_document:
    toc: TRUE
    toc_float: TRUE
    code_download: TRUE
---
####################################
# SUPERIVISED MACHINE LEARNING
####################################

# What is Supervised Machine Learning? 
#### 1) It is the use of labeled datasets to train algorithms to either classify data or predict outcomes accurately (IBM).   
#### 2) SML uses a training dataset to teach models to yield a desiered output. The training dataset allows the model to learn over time.    
#### 3) SML is separated into two types of data mining: regression and classification.   

#### References: What is Supervised Machine Learning? 
#### IBM. Source: https://www.ibm.com/topics/supervised-learning 

###############
# Dataset
###############

#### 1) The Columbus Ohio Spatial Analysis Dataset is a dataframe of 49 rows and 22 columns. Each row corresponds to the 49 neighborhoods in Columbus, Ohio. 
#### 2) It is a real estate dataset which focuses on predicting the housing value. 
#### 3) The library() includes the requireed shapefile (electronic map) to visualize data patterns across Columbus, OH. 
#### 4) Variables' description can be found in the following link -> https://search.r-project.org/CRAN/refmans/RgoogleMaps/html/columbus.html 

#### lets upload the required libraries

#### data manipulation & data visualization
```{r}
library(foreign)        # Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase'
library(ggplot2)        # It is a system for creating graphics
library(dplyr)          # A fast, consistent tool for working with data frame like objects
library(mapview)        # Quickly and conveniently create interactive visualizations of spatial data with or without background maps
library(naniar)         # Provides data structures and functions that facilitate the plotting of missing values and examination of imputations.
#library(maptools)       # A collection of functions to create spatial weights matrix objects from polygon 'contiguities', for summarizing these objects, and for permitting their use in spatial data analysis
library(tmap)           # For drawing thematic maps
library(RColorBrewer)   # It offers several color palettes 
library(dlookr)         # A collection of tools that support data diagnosis, exploration, and transformation
```


```{r}
# predictive modeling
library(regclass)       # Contains basic tools for visualizing, interpreting, and building regression models
library(mctest)         # Multicollinearity diagnostics
library(lmtest)         # Testing linear regression models
library(spdep)          # A collection of functions to create spatial weights matrix objects from polygon 'contiguities', for summarizing these objects, and for permitting their use in spatial data analysis
library(sf)             # A standardized way to encode spatial vector data
library(spData)         # Diverse spatial datasets for demonstrating, benchmarking and teaching spatial data analysis
library(spatialreg)     # A collection of all the estimation functions for spatial cross-sectional models
library(caret)          # The caret package (short for Classification And Rgression Training) contains functions to streamline the model training process for complex regression and classification problems.
library(e1071)          # Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, generalized k-nearest neighbor.
library(SparseM)        # Provides some basic R functionality for linear algebra with sparse matrices
library(Metrics)        # An implementation of evaluation metrics in R that are commonly used in supervised machine learning
library(randomForest)   # Classification and regression based on a forest of trees using random inputs 
library(jtools)         # This is a collection of tools for more efficiently understanding and sharing the results of (primarily) regression analyses 
library(xgboost)        # The package includes efficient linear model solver and tree learning algorithms
library(DiagrammeR)     # Build graph/network structures using functions for stepwise addition and deletion of nodes and edges 
library(effects)        # Graphical and tabular effect displays, e.g., of interactions, for various statistical models with linear predictors
library(shinyjs)
library(sp)
#library(geoR)
library(gstat)
library(caret)
```

#### lets upload the required dataset
```{r}
columbus <- st_read(system.file("shapes/columbus.shp", package="spData")[1])
col.gal.nb  <- read.gal(system.file("etc/weights/columbus.gal", package="spdep"))
columbus_sf <- read_sf(system.file("etc/shapes/columbus.shp", package="spdep"))
```

#### lets visualize the variables distribution across Columbus, Ohio

#### Map of Columbus, Ohio
```{r}
tm_shape(columbus_sf) + tm_polygons(col='wheat') + 
  tm_style("classic") + 
  tm_text(text='POLYID',size=0.7)

```

#### Mapping the main variable of interest (HOVAL: housing value in $1,000)

#### map option # 1
```{r}
tmap_mode("view")
tm_shape(columbus_sf) + 
  tm_fill("HOVAL", style="quantile", title = "House Prices (Quantile)") + 
  tm_layout(main.title = "Columbus, Ohio", legend.position = c("left", "top"), 
            legend.title.size = 0.8, legend.text.size = 0.7)
```

#### map option # 2
```{r}
ggplot(data = columbus_sf) +
  geom_sf(aes(fill = HOVAL)) +
  ggtitle(label = "Columbus, Ohio", subtitle = "House Prices in $1,000")
```


#### Mapping some explanatory variables 
```{r}
tmap_mode("plot")
```

#### Take a look of a palette of colors to display a map
```{r}
tmaptools::palette_explorer()
```

```{r}
income_map <- tm_shape(columbus_sf) + 
  tm_fill("INC", palette = "Blues", style = "quantile", title = "Income") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

distance_map <- tm_shape(columbus_sf) + 
  tm_fill("DISCBD", palette = "BuPu", style = "quantile", title = "Distance to CBD") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

tmap_arrange(income_map,distance_map,nrow=1)
```

#### to estimate a spatial regression analysis it is required to build a spatial matrix that connects the neighborhoods across Columbus, Ohio
```{r}
#map_centroid <- coordinates(columbus) 
map.linkW    <- nb2listw(col.gal.nb, style="W")   
plot(columbus,border="blue",axes=FALSE,las=1, main="Columbus Ohio - Spatial Connectivity Matrix")
plot(columbus,col="grey",border=grey(0.9),axes=T,add=T) 
#plot(map.linkW,coords=map_centroid,pch=19,cex=0.1,col="red",add=T) 
```

#### is it required to estimate a spatial regression model? 
#### what is the global moran's index? how to interpret the global moran's index? 
```{r}
moran.test(columbus$HOVAL, listw = map.linkW, zero.policy = TRUE, na.action = na.omit)     
# Ho: data are randomly distributed across space                                              
# Ha: clusters of data observations might be displayed across space
```


##################################
# CROSS - VALIDATION DATASET
##################################

#### What is cross-validation? It is a statistical method to evaluate and compare learning algorithms by dividing data into two segments: 
#### one used to learn or train a model and the other used to validate or test the model 
#### (Refaelizageh, Tang, and Liu, 2009).  

```{r}
columbus_data <- st_drop_geometry(columbus)
```

Lets split data into training and test sets the training set is used to build the model and the test set to evaluate its predictive accuracy.
```{r}
set.seed(123) # What is set.seed()? We want to make sure that we get the same results for randomization each time you run the script.   
partition <- createDataPartition(y = columbus_data$INC, p=0.7, list=F)
train = columbus_data[partition, ]
test  = columbus_data[-partition, ]
```

###########
# OLS
###########

#### 1) OLS stands for Ordinary Least Squares. OLS is an estimation technique for estimating coefficients of linear regression. 
#### 2) OLS estimation technique consists in minimizing the sum of squared differences between observed and predicted values. 
#### 3) In other words, OLS estimation technique aims to minimize the prediction error between the predicted and the observaed values. 

```{r}
ols_model <- lm(HOVAL ~ INC + CRIME + OPEN + PLUMB + DISCBD + EW, data = columbus_data)
summary(ols_model)

log_ols_model <- lm(log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN +0.01) + log(PLUMB) + log(DISCBD) + EW, data = columbus_data)
summary(log_ols_model)

AIC(ols_model)      # AIC = 317.48
AIC(log_ols_model)  # AIC = 28.85

RMSE_ols_model     <- sqrt(mean(ols_model$residuals^2))
RMSE_log_ols_model <- sqrt(mean(log_ols_model$residuals^2))
```

#################################################
# Spatial Distribution Regression Residuals
#################################################

```{r}
columbus$reg_residuals <- log_ols_model$residuals
columbus$fitted        <- exp(log_ols_model$fitted.values)

### summary(columbus)

map_residuals <- tm_shape(columbus) + 
  tm_fill("reg_residuals", palette = "PuRd", style = "quantile", title = "log OLS Residuals") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

### Observed vs Predicted Values 
tmap_mode("plot")

observed <- tm_shape(columbus) + 
  tm_fill("HOVAL", palette = "Oranges", style = "quantile", title = "HOVAL") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

fitted <- tm_shape(columbus) + 
  tm_fill("fitted", palette = "Oranges", style = "quantile", title = "Fitted HOVAL") +
  tm_borders(alpha=.4) + tm_layout(legend.text.size = 0.8, legend.title.size = 1.1, frame = FALSE)

tmap_arrange(observed,fitted,nrow=1)
```

###########
# SAR
###########

#### 1) SAR stands for Spatial Autoregressive Model. SAR is a spatial model specification which includes as an explanatory variable the spatial lag of the dependent variable. 
#### 2) If the dependent variable displays clustering of similar / dissimilar values across the geographic unit of analysis then it is required to specify the spatial lag of the dependent variable.  
#### 3) The specification of the spatial lag of the dependent variable might significantly improve the estimated regression results whereas improving model accuracy. 

```{r}
sar_model <- lagsarlm(log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN +0.01) + log(PLUMB) + log(DISCBD) + EW, data=columbus, map.linkW, method="Matrix")

# Assuming HOVAL is your response variable, we first need to compute the predicted values
columbus$predicted <- predict(sar_model)

# Calculate the residuals
columbus$residuals <- columbus$HOVAL - columbus$predicted

# Calculate RMSE
RMSE_SAR <- sqrt(mean(columbus$residuals^2))

summary(sar_model)

#RMSE_SAR <- sqrt(mean((columbus - sar_model)^2))

```

###########
# SEM
############

#### 1) SEM stands for Spatial Error Model. SEM is a spatial model specification which includes the spatial lag of the error term (regression residuals) with the aim to confirm the misspecification of the regression model.
#### 2) SEM is a useful regression model specification when the estimated regression residuals of the baseline model display clustering / agglomeration across the geographic unit of analysis.  
#### 3) If the estimated regression model residuals display clustering / agglomeration across the geographic unit of analysis then there is a misspecification of the regression model. 

```{r}
sem_model <- errorsarlm(log(HOVAL) ~ log(INC) + log(CRIME) + log(OPEN +0.01) + log(PLUMB) + log(DISCBD) + EW, data=columbus, map.linkW, method="Matrix")
# Assuming HOVAL is your response variable, we first need to compute the predicted values
columbus$predicted <- predict(sar_model)

# Calculate the residuals
columbus$residuals <- columbus$HOVAL - columbus$predicted

# Calculate RMSE
RMSE_SEM <- sqrt(mean(columbus$residuals^2))

summary(sem_model)

#RMSE_SAR <- sqrt(mean((columbus_data - sem_model)^2))
```

##########################
# XGBoost Regression
##########################

#### **NOTE**: The estimation regression method XGBoost are sensitive to specifying commands such as log() and I()^2 in the regression equation so we will do the data transformation and directly include it the dataset. 

```{r}
columbus_data_alt <- columbus_data %>% select(HOVAL, INC, CRIME, OPEN, PLUMB, DISCBD, EW)

columbus_data_alt$INC      <- log(columbus_data_alt$INC)
columbus_data_alt$CRIME    <- log(columbus_data_alt$CRIME)
columbus_data_alt$OPEN     <- ((columbus_data_alt$OPEN) + 0.01)
columbus_data_alt$OPEN     <- log(columbus_data_alt$OPEN)
columbus_data_alt$PLUMB    <- log(columbus_data_alt$PLUMB)
columbus_data_alt$DISCBD   <- log(columbus_data_alt$DISCBD)

summary(columbus_data_alt)

set.seed(123) # What is set.seed()? We want to make sure that we get the same results for randomization each time you run the script.   
cv_data   <- createDataPartition(y = columbus_data_alt$INC, p=0.7, list=F)
cv_train = columbus_data_alt[cv_data, ]
cv_test = columbus_data_alt[-cv_data, ]
```

#### define explanatory variables (X's) and dependent variable (Y) in training set
```{r}
train_x = data.matrix(cv_train[, -1])
train_y = cv_train[,1]
```

#### define explanatory variables (X's) and dependent variable (Y) in testing set
```{r}
test_x = data.matrix(cv_test[, -1])
test_y = cv_test[, 1]
```

#### define final training and testing sets
```{r}
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test  = xgb.DMatrix(data = test_x, label = test_y)
```

#### Lets fit XGBoost regression model and display RMSE for both training and testing data at each round
```{r}
watchlist = list(train=xgb_train, test=xgb_test)
model_xgb = xgb.train(data=xgb_train, max.depth=3, watchlist=watchlist, nrounds=70) # the more the number of rounds selected, the longer the time to display the results. 

# Looks like the lowest RMSE for both training and test dataset is achieved at 59 round. 
# Lets estimate our final regression model
reg_xgb = xgboost(data = xgb_train, max.depth = 3, nrounds = 59, verbose = 0) # setting verbose = 0 avoids to display the training and testing error for each round. 
prediction_xgb_test<-predict(reg_xgb, xgb_test)
rmse(prediction_xgb_test, cv_test$HOVAL)

# Lets do some diagnostic check of regression residuals 
xgb_reg_residuals<-cv_test$HOVAL - prediction_xgb_test
plot(xgb_reg_residuals, xlab= "Dependent Variable", ylab = "Residuals", main = 'XGBoost Regression Residuals')
abline(0,0)

# Plot first 3 trees of model
xgb.plot.tree(model=reg_xgb, trees=0:2)
importance_matrix <- xgb.importance(model = reg_xgb)
xgb.plot.importance(importance_matrix, xlab = "Explanatory Variables X's Importance")
```

##########################
# Suggested Readings
##########################

#### 1) Understanding Linear Regression Output in R
#### https://towardsdatascience.com/understanding-linear-regression-output-in-r-7a9cbda948b3 

#### 2) How to run and interpret simple regression models in R
#### https://medium.com/data-and-beyond/how-to-run-and-interpret-simple-regression-models-in-r-718623c524c1 

#### 3) What is Supervised Learning? 
#### https://www.ibm.com/topics/supervised-learning

#### 4) Supervised vs. Unsupervised Learning
#### https://www.ibm.com/blog/supervised-vs-unsupervised-learning/ 

#### 5) Maps of Meaning: Why You Need to Study Spatial Statistics as a Data Scientist 
#### https://baotramduong.medium.com/maps-of-meaning-why-you-need-to-study-spatial-statistics-as-a-data-scientist-802ce6ce2878



# Exploratory Data Analysis

1. Visualización de Datos: La creación de mapas utilizando tm_shape y ggplot para visualizar los precios de las casas (HOVAL) en Columbus, Ohio, permite identificar patrones espaciales y áreas de interés. Las advertencias sobre la proyección desconocida sugieren la necesidad de especificar la proyección de los datos para análisis espaciales precisos.

2. Análisis Preliminar: La exploración inicial muestra una variabilidad significativa en los precios de las viviendas, lo cual es crucial para comprender la dinámica del mercado inmobiliario en diferentes barrios.



# Diagnostic Tests

1. Errores en Coordenadas: El error al intentar extraer coordenadas indica que la función coordinates no es compatible con objetos de tipo sf. Considera usar st_coordinates para objetos sf.

2. Pruebas de Modelo: Los valores de AIC para los modelos ols_model y log_ols_model sugieren que el modelo logarítmico proporciona un mejor ajuste, indicando la importancia de transformaciones logarítmicas para algunas variables.



# Main Insights

1. Influencia de Variables: La regresión espacial muestra que el ingreso y la criminalidad tienen efectos significativos en los precios de las viviendas. La importancia de estas variables destaca la relevancia socioeconómica y de seguridad en la valoración de propiedades.

2. Modelado Espacial: Los modelos espaciales, como el modelo SAR y SEM, demuestran la autocorrelación espacial en los datos. Aunque el valor de Rho en el modelo SAR y Lambda en el modelo SEM no son altamente significativos (p-value > 0.05), indican que hay una dependencia espacial leve que debe considerarse en análisis más detallados.

3. Predicción y Residuos: La predicción de precios de viviendas utilizando XGBoost y la evaluación de residuos destacan la capacidad del modelo para capturar la variabilidad en los datos, aunque el aumento del RMSE con más rondas de entrenamiento sugiere sobreajuste.
