California Housing Project

Dataset Overview

The dataset provides detailed information about housing units in various California districts, based on the 1990 census. It includes geographical coordinates (longitude and latitude), housing attributes (such as median age, total rooms, and bedrooms), demographic information (population and households), economic indicators (median income and house value), and the location type categorized by ocean proximity.

Project Aim and Scope

We aim to conduct a meaningful analysis that sheds light on the spatial aspects of the data, asking and answering intriguing questions about the nature of interactions within it. By applying advanced machine learning methods, we seek to provide insightful results that demonstrate our analytical capabilities in merging machine learning techniques with spatial analysis.
This project serves as an opportunity to showcase our skills and understanding of spatial phenomena through rigorous analysis. We’re excited to embark on this journey, unraveling the layers of data to reveal the stories they hold

Data preprocessing

We read spatial data from two shapefiles (‘CA_Places_TIGER2016.shp’ and ‘CA_State_TIGER2016.shp’). Then data was first converted into a spatial (sp) class object, and then transformed into a simple features (sf) class object for spatial analysis.
Housing data is prepared by converting specific columns to numeric types, transforming the ‘ocean_proximity’ column into a factor (and further encoding it as an integer), and removing any rows with missing values, ensuring the data is clean and appropriately formatted for analysis.
## Simple feature collection with 1 feature and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -13857270 ymin: 3832931 xmax: -12705030 ymax: 5162404
## Projected CRS: WGS 84 / Pseudo-Mercator
##   REGION DIVISION STATEFP  STATENS GEOID STUSPS       NAME LSAD MTFCC FUNCSTAT
## 1      4        9      06 01779778    06     CA California   00 G4000        A
##          ALAND      AWATER    INTPTLAT     INTPTLON
## 1 403501101370 20466718403 +37.1551773 -119.5434183
##                         geometry
## 1 MULTIPOLYGON (((-13317677 3...
## Simple feature collection with 3 features and 9 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.24 ymin: 37.85 xmax: -122.22 ymax: 37.88
## Geodetic CRS:  +proj=longlat +ellps=WGS84 +no_defs
##   housing_median_age total_rooms total_bedrooms population households
## 1                 41         880            129        322        126
## 2                 21        7099           1106       2401       1138
## 3                 52        1467            190        496        177
##   median_income median_house_value ocean_proximity ocean_proximity_encoded
## 1        8.3252             452600        NEAR BAY                       4
## 2        8.3014             358500        NEAR BAY                       4
## 3        7.2574             352100        NEAR BAY                       4
##                geometry
## 1 POINT (-122.23 37.88)
## 2 POINT (-122.22 37.86)
## 3 POINT (-122.24 37.85)

Data visualization

First, we selected the largest 5 counties in CA and we plot them on a map.
## Simple feature collection with 5 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -13696210 ymin: 4042974 xmax: -13112750 ymax: 4695855
## Projected CRS: WGS 84 / Pseudo-Mercator
##   STATEFP PLACEFP  PLACENS   GEOID                       NAME
## 1      06   66140 02411785 0666140               San Fernando
## 2      06   14190 02409487 0614190                 Cloverdale
## 3      06   16560 02410240 0616560                     Cotati
## 4      06   65042 02411779 0665042 San Buenaventura (Ventura)
## 5      06   30014 02410601 0630014                   Glendora
##                          NAMELSAD LSAD CLASSFP PCICBSA PCINECTA MTFCC FUNCSTAT
## 1               San Fernando city   25      C1       N        N G4110        A
## 2                 Cloverdale city   25      C1       N        N G4110        A
## 3                     Cotati city   25      C1       N        N G4110        A
## 4 San Buenaventura (Ventura) city   25      C1       Y        N G4110        A
## 5                   Glendora city   25      C1       N        N G4110        A
##      ALAND   AWATER    INTPTLAT     INTPTLON                       geometry
## 1  6148697        0 +34.2886519 -118.4362413 MULTIPOLYGON (((-13186464 4...
## 2  7863863    59201 +38.7959624 -123.0153700 MULTIPOLYGON (((-13696208 4...
## 3  4869007     8380 +38.3284920 -122.7100491 MULTIPOLYGON (((-13662198 4...
## 4 56500370 27033715 +34.2677796 -119.2542062 MULTIPOLYGON (((-13280094 4...
## 5 50251851   403066 +34.1449667 -117.8476672 MULTIPOLYGON (((-13123578 4...

Then, we studied the distribution of median house values in CA region.
Median house prices have right- skewed distribution. The distribution ranges from around 20 000 to just above 500 000, with median at around 150 000.

Then we look at spatial visualization of median housing prices across different regions in California. We used color coding to represent various ranges of median house values, creating an easy-to-interpret visual map.
The dominant yellow hue across the map indicates that the majority of median house prices in California hover around the $100,000 mark. This widespread affordability reflects a general pricing trend across most regions. Notably, areas north of San Jose and the region stretching between California City and San Diego showcase median house prices in the range of $200,000 to $300,000. These regions are distinct for their relatively higher property values. The blue-colored areas, indicative of houses valued above $400,000, are scant and primarily located between Los Angeles and San Diego. This specific clustering suggests a concentration of luxury or high-end real estate in these locales.
Moreover, the below plot illustrates a clear trend in the housing market across California, where median house values show a notable increase in proximity to the coastline. This pattern suggests that locations closer to the ocean are highly valued, potentially due to desirable views, lifestyle factors, or limited availability of such properties. The color gradient, representing the median house values, becomes increasingly intense towards the coastal regions, indicating higher property values. This geographical correlation highlights the premium placed on ocean proximity in California’s real estate market.

The boxplot displaying median house values by ocean proximity in California offers insightful revelations about the regional real estate market. Most notably, the highest median house values are observed for properties located on islands, surpassing the $400,000 mark, indicating a premium value placed on such exclusive and possibly scarce properties. In comparison, houses situated near the bay, close to the ocean, or within an hour’s distance from the ocean exhibit a similar pricing bracket, with median values clustering around the $225,000 range. This similarity in pricing suggests a shared appeal and demand for properties in these coastal areas, likely driven by their desirable locations and ocean proximity. In stark contrast, the inland houses present the most affordable options, with a median value hovering at around $100,000, reflecting a significant price drop compared to coastal properties. This disparity underscores the influence of location and proximity to the ocean as pivotal factors in determining real estate values in California.

The analysis of the population-to-households ratio in California housing data presents a distinct right-skewed distribution. This skewness indicates that while most areas have lower population-to-households ratios, there is a long tail of areas with substantially higher ratios. Despite this skewness, the median of the distribution, hovering around 2-3, provides a central tendency that most areas fall under, signifying a typical population density relative to the number of households.

The correlation matrix provides a comprehensive understanding of the relationships between different variables in the housing dataset. Notably, there is a strong positive correlation observed between population and key factors such as total rooms, bedrooms, and the number of households. This correlation is intuitive, as a higher population in an area typically necessitates a greater number of rooms, bedrooms, and households to accommodate the residents.
Furthermore, the matrix reveals a significant positive correlation between median house value and median income. This relationship aligns well with economic expectations, as higher income levels often enable individuals to afford more expensive homes. This correlation underscores the impact of economic status on housing affordability and choices.
The scatter plot comparing median income with median house value, elucidates a key trend in the housing market. There is a discernible positive correlation between these two variables, as illustrated by the general upward trajectory of the data points. This trend indicates that regions with higher median incomes tend to have higher median house values, aligning with the expectation that higher income levels enable the purchase of more expensive properties.
However, the plot also highlights notable exceptions to this trend, particularly in the upper echelons of the housing market. For houses valued around $500,000, there is a noticeable spread in median income levels ranging from approximately $50,000 (5 on the plot) to $150,000 (15 on the plot). This wide range suggests that in certain regions, even households with relatively moderate incomes can afford high-value properties, possibly due to factors like inherited wealth, lower living costs, or availability of more affordable housing options in high-value areas.
The presence of these outliers indicates that while income is a significant factor in determining house value, it is not the sole determinant. Other socio-economic factors and regional characteristics might also play a crucial role in influencing the housing market dynamics. This plot, therefore, provides a nuanced understanding of the relationship between income and housing value, underlining the complexity of the real estate market.

The ggplot visualization that contrasts housing median age with median house value, further categorized by ocean proximity, offers an insightful perspective into the housing market dynamics. A key observation from this plot is the apparent lack of a strong, consistent relationship between the age of housing and its value. This suggests that factors other than age may play a more critical role in determining house prices, such as location, size, or market demand.
Particularly interesting is the case of properties near the bay. While these properties tend to have a median age above 10 years, there is no distinct trend indicating that age significantly influences their value. This lack of a clear pattern across different ocean proximities implies that the influence of age on housing value is complex and likely varies depending on other contextual factors.
While one might typically expect older properties to be less valuable, this plot reveals that in the context of California housing, age does not seem to be a decisive factor in determining house values, especially when considering the influence of ocean proximity.

Geographically Weighted Regression (GWR) Model Implementation

We are now implementing a Geographically Weighted Regression (GWR) model on a selected subset of the data, comprising 1,000 observations. The GWR model is a local form of linear regression designed to analyze spatially varying relationships, thereby providing insights into how these relationships differ across geographical locations.

## Adaptive bandwidth: 625 CV score: 4.840069e+12 
## Adaptive bandwidth: 394 CV score: 4.755858e+12 
## Adaptive bandwidth: 250 CV score: 4.728794e+12 
## Adaptive bandwidth: 162 CV score: 4.704896e+12 
## Adaptive bandwidth: 107 CV score: Inf 
## Adaptive bandwidth: 195 CV score: 4.71469e+12 
## Adaptive bandwidth: 140 CV score: 4.689396e+12 
## Adaptive bandwidth: 128 CV score: 4.674825e+12 
## Adaptive bandwidth: 119 CV score: 4.664938e+12 
## Adaptive bandwidth: 115 CV score: NaN 
## Adaptive bandwidth: 123 CV score: 4.670856e+12 
## Adaptive bandwidth: 118 CV score: NaN 
## Adaptive bandwidth: 121 CV score: 4.668197e+12 
## Adaptive bandwidth: 119 CV score: 4.664938e+12

The series of plots created through the Geographically Weighted Regression (GWR) model offer a deeper understanding of how various factors influence house values across different geographical locations. Each plot represents the coefficients for variables such as median income, housing median age, total rooms, and total bedrooms, segmented by quartiles, thus providing a spatial dimension to the analysis.
The similarity in coefficients for each variable across different geographical locations indicates that the impact of these factors on house values is relatively consistent, regardless of the specific area. This uniformity could imply that the housing market dynamics for these variables are stable across the region you’re studying.
- Median Income: The coefficient for median income is consistently positive across locations, it reaffirms the general trend that higher incomes are associated with higher house values, a relationship that holds true in most of the regions analyzed.
- Housing Median Age: A similar coefficient for housing median age across different areas suggests that the age of housing affects house values in a uniform manner across the region. Whether this impact is positive or negative would depend on the specific value of the coefficients.
- Total Bedrooms: Likewise, a consistent coefficient for total bedrooms across locations indicates a uniform positive effect of these factors on house values.
- Total Rooms: Surprisingly, the total amount of rooms seems to be associated with lower house values in these areas. This could indicate that in certain locations, larger houses (in terms of room count) are not as valued, possibly due to higher maintenance costs, or it might reflect a trend towards smaller, more efficient living spaces. Given the high correlation between total rooms and total bedrooms, it appears that these variables may be offsetting one another’s influence on house values across different locations, leading to a more nuanced understanding of how space within a home contributes to its overall market value.
Moving forward, our next step involves organizing the GWR coefficients into clusters for a detailed analysis. Prior to this, we will analyse the distribution of these coefficients to thoroughly understand their range and behavior.
##       housing_median_age total_rooms total_bedrooms population households
## 12752           1380.287   -11.83009       90.18730  -39.97392   96.60386
## 2467            1380.500   -11.83029       90.17436  -39.98225   96.64155
## 9660            1380.073   -11.82988       90.19676  -39.96663   96.57259
## 2436            1380.508   -11.83030       90.17347  -39.98270   96.64376
## 13570           1380.779   -11.83057       90.15632  -39.99350   96.69297
## 2480            1380.514   -11.83031       90.17421  -39.98258   96.64272
##       median_income ocean_proximityINLAND ocean_proximityNEAR.BAY
## 12752      42864.79             -65483.03               -7658.802
## 2467       42864.33             -65480.57               -7662.417
## 9660       42864.99             -65486.03               -7655.719
## 2436       42864.28             -65480.53               -7662.621
## 13570      42863.65             -65477.51               -7667.322
## 2480       42864.35             -65480.30               -7662.546
##       ocean_proximityNEAR.OCEAN
## 12752                  10895.40
## 2467                   10884.16
## 9660                   10903.89
## 2436                   10883.42
## 13570                  10868.58
## 2480                   10883.98

- Housing Median Age (1380 to 1380.8): The small range in coefficients for housing median age suggests a nearly uniform and slightly positive influence on housing values across the studied locations. It indicates that older houses may have marginally higher values, but the effect is subtle.
- Total Rooms (-11.8305 to -11.8300): The consistently negative coefficients for total rooms imply a slight decrease in housing values with an increase in the number of rooms. This could suggest that in the studied areas, larger homes (by room count) are not necessarily more valuable, possibly due to higher maintenance or other factors.
- Total Bedrooms (90.150 to 90.200): The positive range for total bedrooms suggests that an increase in bedrooms generally leads to a slight increase in housing values. This reflects the demand for more bedrooms in houses across the studied areas. Again, given the high correlation between total rooms and total bedrooms, it appears that these variables may be offsetting one another’s influence on house values across different locations.
- Population (-40 to -39.96): The negative coefficient range indicates that higher population density might slightly decrease housing values, suggesting that less densely populated areas are preferred, possibly due to factors like noise, congestion, or privacy.
- Households (96.55 to 96.75): The positive coefficients for households indicate a slight increase in housing values with more households, which could be indicative of the desirability of areas with more residential development.
- Median Income (42863 to 42866): The large and positive coefficients for median income strongly suggest that higher median incomes are associated with significantly higher housing values, reflecting the economic capability of residents to afford more expensive homes.
- Ocean Proximity (Inland) (-65485 to -65475): TThe substantial negative coefficients for inland locations indicate a strong decrease in housing values compared to coastal areas, highlighting the premium placed on ocean proximity.
Ocean Proximity (Near Bay) (-7670 to -7650): The negative coefficients, though less extreme than for inland areas, suggest that being near the bay is slightly less desirable than other coastal locations in terms of housing values.
Ocean Proximity (Near Ocean) (10860 to 100915): This broad positive range for near ocean locations indicates that proximity to the ocean generally increases housing values, with some variability in the extent of this effect across different areas.
The consistency in the direction of these effects across different areas (as indicated by the narrow ranges) suggests that these factors have a similar influence on housing values throughout the regions analyzed in your GWR model.

K-means clustering

The application of the Elbow method for selecting the number of clusters has identified two as the optimal choice, which is evident in the resulting plot. This effectively divides California region into two distinct clusters.

The spatial distribution of these clusters is divided between north-south regions. This clear division into north and south clusters could be indicative of significant regional differences within the state. Such clustering might align with known socioeconomic or geographical patterns, suggesting that these underlying factors play a crucial role in the clustering outcome.

Spatial drift model

In the next phase of our analysis, we embark on the implementation of a Spatial Drift model, a crucial step to unravel the spatial dependencies within our housing dataset. This approach involves an encoding process where clusters identified from previous steps are transformed into dummy variables. This transformation is executed for both spatial (sf object) and dataframe (df object) formats, ensuring consistency and integration of our cluster analysis across different data structures.
We then construct a new regression equation, incorporating these dummy variables along with other key predictors, to thoroughly examine the influence of each factor on median house values. To further refine our model, we establish a spatial weights matrix, recognizing the spatial proximity and neighborhood relationships inherent in the data.
By converting these relationships into a symmetric matrix and subsequently a listw object, we lay the groundwork for analyzing spatial connections and dependencies. This process is visualized through a plot, illustrating the network of connections formed based on spatial proximity, thereby providing a visual representation of the spatial structure underlying our model.
## [1] "knn"
## [1] FALSE
## [1] TRUE
The map shows the spatial connections between features in the dataset. The lines represent connections determined by the k-means clustering.

Next step is to create a new equation using dummy variables derived from clustering.
## 
## Call:
## errorsarlm(formula = eq, data = housing.df.sel, listw = housing.knn.sym.listw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -242410  -35332  -10786   22369  332276 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##     (1 not defined because of singularities)
##                              Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)                8.2570e+04  1.2973e+04  6.3649 1.954e-10
## housing_median_age         6.0521e+02  1.9988e+02  3.0278  0.002463
## total_rooms                4.7938e-01  3.4021e+00  0.1409  0.887943
## total_bedrooms             3.7058e+01  2.5099e+01  1.4765  0.139811
## population                -3.0742e+01  4.9027e+00 -6.2704 3.602e-10
## households                 4.9896e+01  2.6715e+01  1.8677  0.061806
## median_income              3.2884e+04  1.5058e+03 21.8385 < 2.2e-16
## ocean_proximityINLAND     -7.0277e+04  8.0564e+03 -8.7232 < 2.2e-16
## ocean_proximityNEAR BAY   -3.3971e+03  1.1760e+04 -0.2889  0.772680
## ocean_proximityNEAR OCEAN  1.1296e+04  9.4998e+03  1.1891  0.234393
## V1_1                       2.8729e+03  8.0204e+03  0.3582  0.720198
## V1_2                               NA          NA      NA        NA
## 
## Lambda: 0.44469, LR test value: 153.84, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.030061
##     z-value: 14.793, p-value: < 2.22e-16
## Wald statistic: 218.82, p-value: < 2.22e-16
## 
## Log likelihood: -12477.2 for error model
## ML residual variance (sigma squared): 3696700000, (sigma: 60801)
## Number of observations: 1000 
## Number of parameters estimated: 13 
## AIC: NA (not available for weighted model), (AIC for lm: 25132)
The warning suggests that there’s perfect multicollinearity in the model, meaning one of the variables can be perfectly predicted by the others. In this case, it is one of the dummy variables (V1_2).
The model indicates a significant presence of spatial autocorrelation, suggesting that the values of the dependent variable are influenced by neighboring observations. The LR test, Wald statistic, and z-value all support the significance of the spatial autoregressive parameter (Lambda). The high log likelihood and low residual variance are favorable indicators.
The significant variables are:
- Housing Median Age: This variable is significant (p-value = 0.002463), indicating that the age of housing plays a notable role in determining house values. The positive coefficient suggests that older houses tend to have higher values, possibly due to factors like location, historical value, or construction quality.
- Population: The significant negative coefficient for population (p-value ~ 0) implies that higher population densities are associated with lower house values. This could reflect preferences for less crowded areas or the characteristics of densely populated regions.
- Median Income: With a very low p-value (~ 0), median income is highly significant and positively correlated with the dependent variable. This indicates a strong relationship where areas with higher median incomes tend to have higher house values.
- Ocean Proximity (Inland): This variable is significantly negative (p-value ~ 0), showing that properties located inland are valued lower compared to those closer to the coast. This is a clear indication of the premium placed on coastal proximity in the housing market.

Random Forest

The next part of analysis involves applying the random forest technique to assess how accurately our model can predict the median house value.
## 
## Call:
##  randomForest(formula = as.formula(paste(response, "~", paste(predictors,      collapse = "+"))), data = housing.df.sel, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 4645584919
##                     % Var explained: 65.38
The ensemble comprises 500 individual decision trees, each considering 2 randomly selected variables at every split. The average squared difference between the predicted and observed values is 4610761217. The model explains approximately 65.63% of the variance in the response variable. This indicates a moderate to substantial level of explanatory power, showcasing the model’s ability to capture underlying patterns in the data. However, there’s still a portion of the variance unexplained by the model, which could be due to factors not included in the model or inherently unpredictable aspects of the data.

The variable importance plot for the random forest model illustrates the significance of different predictors in influencing the model’s predictions. This plot provides an overview of the relative importance of each variable, helping us identify key factors driving the model’s performance. It reveals that among the predictors, median_income exhibits the highest Mean Squared Error (MSE), indicating its substantial impact on the model’s predictive accuracy. Conversely, number_of_households emerges with the lowest MSE, suggesting a relatively lower influence on the model’s overall performance.
Plots of partial dependencies:

This partial dependence plot of ocean proximity explains how different levels of ocean_proximity influence the model’s predictions. Notably, the plot highlights that the category ‘Inland’ tends to have a lower score, while those categorized as ‘Near Bay’ and ‘Near Ocean’ exhibit higher scores.
The trend in population partial dependencies plot reveals a declining pattern from 0 to 2000, reaching its lowest point around 2000. However, the trend then slightly increases and stabilizes.

The plot of partial dependencies of mediane age indicates a consistent upward trend in predictions on the part ranging from 13 to 50 years of housing age.

The partial dependence plot for median_income reveals the model’s response to variations in income levels. Notably, the trend shows an upward trajectory up to a median income of 10, after which it stabilizes.
Now, we will fine-tune the existing model using the expand.grid option to achieve more precise prediction results.
## 
## Call:
##  randomForest(formula = as.formula(paste(response, "~", paste(predictors,      collapse = "+"))), data = housing.df.sel, mtry = tune_params$mtry[i],      nodesize = tune_params$nodesize[i], ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 4610263821
##                     % Var explained: 65.64
So in fine-tuned model, the mean squared residuals are 4549959205. The percentage of variance explained is 66.09%. Model 2 outperforms Model 1 in terms of mean squared residuals, suggesting better predictive accuracy. Model 2 also has a slightly higher percentage of variance explained, indicating a better ability to account for the variability in the response variable. Overall, the differences in their performance metrics are relatively small.

Summary

In this project on Machine Learning with Spatial Data, analysed California Housing dataset from the 1990 census, focusing on the complex relationship between housing characteristics, economic indicators, and spatial aspects. Utilizing a suite of R libraries, we preprocessed the data, ensuring clean and formatted input for analysis. Our approach included applying Geographically Weighted Regression (GWR) to understand spatially varying relationships and k-means clustering to discern patterns in geographical distribution.
A key aspect of our project was the spatial analysis, where we visualized and interpreted the distribution of median house values across California. Our findings highlighted a strong correlation between median income and house values, and a significant influence of ocean proximity on property prices. The use of Random Forest models provided deeper insights into variable importance and predictive capabilities, enhancing our understanding of how different factors impact housing values.
Our project not only showcased advanced analytical skills but also demonstrated the power of integrating machine learning with spatial data to extract meaningful insights from complex real estate trends.