California Housing Project
Welcome to our analytical project for the course “Causal Machine
Learning for Spatial Data.” This project represents an exploration of
the intricate relationship between economic features and spatial
dimensions using machine learning techniques. We have chosen the
California Housing dataset from the 1990 census, as featured in Aurélien
Géron’s book ‘Hands-On Machine Learning with Scikit-Learn and
TensorFlow’.
Dataset Overview
The dataset provides detailed information about housing units in
various California districts, based on the 1990 census. It includes
geographical coordinates (longitude and latitude), housing attributes
(such as median age, total rooms, and bedrooms), demographic information
(population and households), economic indicators (median income and
house value), and the location type categorized by ocean proximity.
Project Aim and Scope
We aim to conduct a meaningful analysis that sheds light on the
spatial aspects of the data, asking and answering intriguing questions
about the nature of interactions within it. By applying advanced machine
learning methods, we seek to provide insightful results that demonstrate
our analytical capabilities in merging machine learning techniques with
spatial analysis.
This project serves as an opportunity to showcase our skills and
understanding of spatial phenomena through rigorous analysis. We’re
excited to embark on this journey, unraveling the layers of data to
reveal the stories they hold
Data preprocessing
We read spatial data from two shapefiles (‘CA_Places_TIGER2016.shp’
and ‘CA_State_TIGER2016.shp’). Then data was first converted into a
spatial (sp) class object, and then transformed into a simple features
(sf) class object for spatial analysis.
Data visualization
First, we selected the largest 5 counties in CA and we plot them on
a map.
## Simple feature collection with 5 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -13696210 ymin: 4042974 xmax: -13112750 ymax: 4695855
## Projected CRS: WGS 84 / Pseudo-Mercator
## STATEFP PLACEFP PLACENS GEOID NAME
## 1 06 66140 02411785 0666140 San Fernando
## 2 06 14190 02409487 0614190 Cloverdale
## 3 06 16560 02410240 0616560 Cotati
## 4 06 65042 02411779 0665042 San Buenaventura (Ventura)
## 5 06 30014 02410601 0630014 Glendora
## NAMELSAD LSAD CLASSFP PCICBSA PCINECTA MTFCC FUNCSTAT
## 1 San Fernando city 25 C1 N N G4110 A
## 2 Cloverdale city 25 C1 N N G4110 A
## 3 Cotati city 25 C1 N N G4110 A
## 4 San Buenaventura (Ventura) city 25 C1 Y N G4110 A
## 5 Glendora city 25 C1 N N G4110 A
## ALAND AWATER INTPTLAT INTPTLON geometry
## 1 6148697 0 +34.2886519 -118.4362413 MULTIPOLYGON (((-13186464 4...
## 2 7863863 59201 +38.7959624 -123.0153700 MULTIPOLYGON (((-13696208 4...
## 3 4869007 8380 +38.3284920 -122.7100491 MULTIPOLYGON (((-13662198 4...
## 4 56500370 27033715 +34.2677796 -119.2542062 MULTIPOLYGON (((-13280094 4...
## 5 50251851 403066 +34.1449667 -117.8476672 MULTIPOLYGON (((-13123578 4...

Then, we studied the distribution of median house values in CA
region.
Then we look at spatial visualization of median housing prices
across different regions in California. We used color coding to
represent various ranges of median house values, creating an
easy-to-interpret visual map.
The dominant yellow hue across the map indicates that the majority
of median house prices in California hover around the $100,000 mark.
This widespread affordability reflects a general pricing trend across
most regions. Notably, areas north of San Jose and the region stretching
between California City and San Diego showcase median house prices in
the range of $200,000 to $300,000. These regions are distinct for their
relatively higher property values. The blue-colored areas, indicative of
houses valued above $400,000, are scant and primarily located between
Los Angeles and San Diego. This specific clustering suggests a
concentration of luxury or high-end real estate in these locales.
The correlation matrix provides a comprehensive understanding of the
relationships between different variables in the housing dataset.
Notably, there is a strong positive correlation observed between
population and key factors such as total rooms, bedrooms, and the number
of households. This correlation is intuitive, as a higher population in
an area typically necessitates a greater number of rooms, bedrooms, and
households to accommodate the residents.
Furthermore, the matrix reveals a significant positive correlation
between median house value and median income. This relationship aligns
well with economic expectations, as higher income levels often enable
individuals to afford more expensive homes. This correlation underscores
the impact of economic status on housing affordability and choices.
The scatter plot comparing median income with median house value,
elucidates a key trend in the housing market. There is a discernible
positive correlation between these two variables, as illustrated by the
general upward trajectory of the data points. This trend indicates that
regions with higher median incomes tend to have higher median house
values, aligning with the expectation that higher income levels enable
the purchase of more expensive properties.
However, the plot also highlights notable exceptions to this trend,
particularly in the upper echelons of the housing market. For houses
valued around $500,000, there is a noticeable spread in median income
levels ranging from approximately $50,000 (5 on the plot) to $150,000
(15 on the plot). This wide range suggests that in certain regions, even
households with relatively moderate incomes can afford high-value
properties, possibly due to factors like inherited wealth, lower living
costs, or availability of more affordable housing options in high-value
areas.
The presence of these outliers indicates that while income is a
significant factor in determining house value, it is not the sole
determinant. Other socio-economic factors and regional characteristics
might also play a crucial role in influencing the housing market
dynamics. This plot, therefore, provides a nuanced understanding of the
relationship between income and housing value, underlining the
complexity of the real estate market.

The ggplot visualization that contrasts housing median age with
median house value, further categorized by ocean proximity, offers an
insightful perspective into the housing market dynamics. A key
observation from this plot is the apparent lack of a strong, consistent
relationship between the age of housing and its value. This suggests
that factors other than age may play a more critical role in determining
house prices, such as location, size, or market demand.
Particularly interesting is the case of properties near the bay.
While these properties tend to have a median age above 10 years, there
is no distinct trend indicating that age significantly influences their
value. This lack of a clear pattern across different ocean proximities
implies that the influence of age on housing value is complex and likely
varies depending on other contextual factors.
While one might typically expect older properties to be less
valuable, this plot reveals that in the context of California housing,
age does not seem to be a decisive factor in determining house values,
especially when considering the influence of ocean proximity.

Geographically Weighted Regression (GWR) Model Implementation
We are now implementing a Geographically Weighted Regression (GWR)
model on a selected subset of the data, comprising 1,000 observations.
The GWR model is a local form of linear regression designed to analyze
spatially varying relationships, thereby providing insights into how
these relationships differ across geographical locations.
## Adaptive bandwidth: 625 CV score: 4.840069e+12
## Adaptive bandwidth: 394 CV score: 4.755858e+12
## Adaptive bandwidth: 250 CV score: 4.728794e+12
## Adaptive bandwidth: 162 CV score: 4.704896e+12
## Adaptive bandwidth: 107 CV score: Inf
## Adaptive bandwidth: 195 CV score: 4.71469e+12
## Adaptive bandwidth: 140 CV score: 4.689396e+12
## Adaptive bandwidth: 128 CV score: 4.674825e+12
## Adaptive bandwidth: 119 CV score: 4.664938e+12
## Adaptive bandwidth: 115 CV score: NaN
## Adaptive bandwidth: 123 CV score: 4.670856e+12
## Adaptive bandwidth: 118 CV score: NaN
## Adaptive bandwidth: 121 CV score: 4.668197e+12
## Adaptive bandwidth: 119 CV score: 4.664938e+12

The series of plots created through the Geographically Weighted
Regression (GWR) model offer a deeper understanding of how various
factors influence house values across different geographical locations.
Each plot represents the coefficients for variables such as median
income, housing median age, total rooms, and total bedrooms, segmented
by quartiles, thus providing a spatial dimension to the analysis.
The similarity in coefficients for each variable across different
geographical locations indicates that the impact of these factors on
house values is relatively consistent, regardless of the specific area.
This uniformity could imply that the housing market dynamics for these
variables are stable across the region you’re studying.
- Median Income: The coefficient for median income is consistently
positive across locations, it reaffirms the general trend that higher
incomes are associated with higher house values, a relationship that
holds true in most of the regions analyzed.
- Housing Median Age: A similar coefficient for housing median age
across different areas suggests that the age of housing affects house
values in a uniform manner across the region. Whether this impact is
positive or negative would depend on the specific value of the
coefficients.
- Total Bedrooms: Likewise, a consistent coefficient for total
bedrooms across locations indicates a uniform positive effect of these
factors on house values.
- Total Rooms: Surprisingly, the total amount of rooms seems to be
associated with lower house values in these areas. This could indicate
that in certain locations, larger houses (in terms of room count) are
not as valued, possibly due to higher maintenance costs, or it might
reflect a trend towards smaller, more efficient living spaces. Given the
high correlation between total rooms and total bedrooms, it appears that
these variables may be offsetting one another’s influence on house
values across different locations, leading to a more nuanced
understanding of how space within a home contributes to its overall
market value.
Moving forward, our next step involves organizing the GWR
coefficients into clusters for a detailed analysis. Prior to this, we
will analyse the distribution of these coefficients to thoroughly
understand their range and behavior.
## housing_median_age total_rooms total_bedrooms population households
## 12752 1380.287 -11.83009 90.18730 -39.97392 96.60386
## 2467 1380.500 -11.83029 90.17436 -39.98225 96.64155
## 9660 1380.073 -11.82988 90.19676 -39.96663 96.57259
## 2436 1380.508 -11.83030 90.17347 -39.98270 96.64376
## 13570 1380.779 -11.83057 90.15632 -39.99350 96.69297
## 2480 1380.514 -11.83031 90.17421 -39.98258 96.64272
## median_income ocean_proximityINLAND ocean_proximityNEAR.BAY
## 12752 42864.79 -65483.03 -7658.802
## 2467 42864.33 -65480.57 -7662.417
## 9660 42864.99 -65486.03 -7655.719
## 2436 42864.28 -65480.53 -7662.621
## 13570 42863.65 -65477.51 -7667.322
## 2480 42864.35 -65480.30 -7662.546
## ocean_proximityNEAR.OCEAN
## 12752 10895.40
## 2467 10884.16
## 9660 10903.89
## 2436 10883.42
## 13570 10868.58
## 2480 10883.98

- Housing Median Age (1380 to 1380.8): The small range in
coefficients for housing median age suggests a nearly uniform and
slightly positive influence on housing values across the studied
locations. It indicates that older houses may have marginally higher
values, but the effect is subtle.
- Total Rooms (-11.8305 to -11.8300): The consistently negative
coefficients for total rooms imply a slight decrease in housing values
with an increase in the number of rooms. This could suggest that in the
studied areas, larger homes (by room count) are not necessarily more
valuable, possibly due to higher maintenance or other factors.
- Total Bedrooms (90.150 to 90.200): The positive range for total
bedrooms suggests that an increase in bedrooms generally leads to a
slight increase in housing values. This reflects the demand for more
bedrooms in houses across the studied areas. Again, given the high
correlation between total rooms and total bedrooms, it appears that
these variables may be offsetting one another’s influence on house
values across different locations.
- Population (-40 to -39.96): The negative coefficient range
indicates that higher population density might slightly decrease housing
values, suggesting that less densely populated areas are preferred,
possibly due to factors like noise, congestion, or privacy.
- Households (96.55 to 96.75): The positive coefficients for
households indicate a slight increase in housing values with more
households, which could be indicative of the desirability of areas with
more residential development.
- Median Income (42863 to 42866): The large and positive
coefficients for median income strongly suggest that higher median
incomes are associated with significantly higher housing values,
reflecting the economic capability of residents to afford more expensive
homes.
- Ocean Proximity (Inland) (-65485 to -65475): TThe substantial
negative coefficients for inland locations indicate a strong decrease in
housing values compared to coastal areas, highlighting the premium
placed on ocean proximity.
Ocean Proximity (Near Bay) (-7670 to -7650): The negative
coefficients, though less extreme than for inland areas, suggest that
being near the bay is slightly less desirable than other coastal
locations in terms of housing values.
Ocean Proximity (Near Ocean) (10860 to 100915): This broad positive
range for near ocean locations indicates that proximity to the ocean
generally increases housing values, with some variability in the extent
of this effect across different areas.
The consistency in the direction of these effects across different
areas (as indicated by the narrow ranges) suggests that these factors
have a similar influence on housing values throughout the regions
analyzed in your GWR model.
K-means clustering
The spatial distribution of these clusters is divided between
north-south regions. This clear division into north and south clusters
could be indicative of significant regional differences within the
state. Such clustering might align with known socioeconomic or
geographical patterns, suggesting that these underlying factors play a
crucial role in the clustering outcome.
Spatial drift model
In the next phase of our analysis, we embark on the implementation
of a Spatial Drift model, a crucial step to unravel the spatial
dependencies within our housing dataset. This approach involves an
encoding process where clusters identified from previous steps are
transformed into dummy variables. This transformation is executed for
both spatial (sf object) and dataframe (df object) formats, ensuring
consistency and integration of our cluster analysis across different
data structures.
We then construct a new regression equation, incorporating these
dummy variables along with other key predictors, to thoroughly examine
the influence of each factor on median house values. To further refine
our model, we establish a spatial weights matrix, recognizing the
spatial proximity and neighborhood relationships inherent in the
data.
The map shows the spatial connections between features in the
dataset. The lines represent connections determined by the k-means
clustering.

Next step is to create a new equation using dummy variables derived
from clustering.
##
## Call:
## errorsarlm(formula = eq, data = housing.df.sel, listw = housing.knn.sym.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -242410 -35332 -10786 22369 332276
##
## Type: error
## Coefficients: (asymptotic standard errors)
## (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.2570e+04 1.2973e+04 6.3649 1.954e-10
## housing_median_age 6.0521e+02 1.9988e+02 3.0278 0.002463
## total_rooms 4.7938e-01 3.4021e+00 0.1409 0.887943
## total_bedrooms 3.7058e+01 2.5099e+01 1.4765 0.139811
## population -3.0742e+01 4.9027e+00 -6.2704 3.602e-10
## households 4.9896e+01 2.6715e+01 1.8677 0.061806
## median_income 3.2884e+04 1.5058e+03 21.8385 < 2.2e-16
## ocean_proximityINLAND -7.0277e+04 8.0564e+03 -8.7232 < 2.2e-16
## ocean_proximityNEAR BAY -3.3971e+03 1.1760e+04 -0.2889 0.772680
## ocean_proximityNEAR OCEAN 1.1296e+04 9.4998e+03 1.1891 0.234393
## V1_1 2.8729e+03 8.0204e+03 0.3582 0.720198
## V1_2 NA NA NA NA
##
## Lambda: 0.44469, LR test value: 153.84, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.030061
## z-value: 14.793, p-value: < 2.22e-16
## Wald statistic: 218.82, p-value: < 2.22e-16
##
## Log likelihood: -12477.2 for error model
## ML residual variance (sigma squared): 3696700000, (sigma: 60801)
## Number of observations: 1000
## Number of parameters estimated: 13
## AIC: NA (not available for weighted model), (AIC for lm: 25132)
The warning suggests that there’s perfect multicollinearity in the
model, meaning one of the variables can be perfectly predicted by the
others. In this case, it is one of the dummy variables (V1_2).
The model indicates a significant presence of spatial
autocorrelation, suggesting that the values of the dependent variable
are influenced by neighboring observations. The LR test, Wald statistic,
and z-value all support the significance of the spatial autoregressive
parameter (Lambda). The high log likelihood and low residual variance
are favorable indicators.
The significant variables are:
- Housing Median Age: This variable is significant (p-value =
0.002463), indicating that the age of housing plays a notable role in
determining house values. The positive coefficient suggests that older
houses tend to have higher values, possibly due to factors like
location, historical value, or construction quality.
- Population: The significant negative coefficient for population
(p-value ~ 0) implies that higher population densities are associated
with lower house values. This could reflect preferences for less crowded
areas or the characteristics of densely populated regions.
- Median Income: With a very low p-value (~ 0), median income is
highly significant and positively correlated with the dependent
variable. This indicates a strong relationship where areas with higher
median incomes tend to have higher house values.
- Ocean Proximity (Inland): This variable is significantly negative
(p-value ~ 0), showing that properties located inland are valued lower
compared to those closer to the coast. This is a clear indication of the
premium placed on coastal proximity in the housing market.
Random Forest
The ensemble comprises 500 individual decision trees, each
considering 2 randomly selected variables at every split. The average
squared difference between the predicted and observed values is
4610761217. The model explains approximately 65.63% of the variance in
the response variable. This indicates a moderate to substantial level of
explanatory power, showcasing the model’s ability to capture underlying
patterns in the data. However, there’s still a portion of the variance
unexplained by the model, which could be due to factors not included in
the model or inherently unpredictable aspects of the data.

The variable importance plot for the random forest model illustrates
the significance of different predictors in influencing the model’s
predictions. This plot provides an overview of the relative importance
of each variable, helping us identify key factors driving the model’s
performance. It reveals that among the predictors, median_income
exhibits the highest Mean Squared Error (MSE), indicating its
substantial impact on the model’s predictive accuracy. Conversely,
number_of_households emerges with the lowest MSE, suggesting a
relatively lower influence on the model’s overall performance.
This partial dependence plot of ocean proximity explains how
different levels of ocean_proximity influence the model’s predictions.
Notably, the plot highlights that the category ‘Inland’ tends to have a
lower score, while those categorized as ‘Near Bay’ and ‘Near Ocean’
exhibit higher scores.
The trend in population partial dependencies plot reveals a
declining pattern from 0 to 2000, reaching its lowest point around 2000.
However, the trend then slightly increases and stabilizes.
Summary
In this project on Machine Learning with Spatial Data, analysed
California Housing dataset from the 1990 census, focusing on the complex
relationship between housing characteristics, economic indicators, and
spatial aspects. Utilizing a suite of R libraries, we preprocessed the
data, ensuring clean and formatted input for analysis. Our approach
included applying Geographically Weighted Regression (GWR) to understand
spatially varying relationships and k-means clustering to discern
patterns in geographical distribution.
A key aspect of our project was the spatial analysis, where we
visualized and interpreted the distribution of median house values
across California. Our findings highlighted a strong correlation between
median income and house values, and a significant influence of ocean
proximity on property prices. The use of Random Forest models provided
deeper insights into variable importance and predictive capabilities,
enhancing our understanding of how different factors impact housing
values.