1 Abstract

This statistical report explores the dynamics of California house prices, utilizing a modified version of the California Housing data set derived from the 1990 census data. Employing methodologies such as K-means clustering, Box-Cox transformations, and bootstrapping, this analysis aims to uncover nuanced patterns and relationships within this housing market.

Key predictors that were analyzed spanned geographical location, housing characteristics, demographic details, and economic indicators of US Census block groups (i.e. neighborhoods grouped by the Census Bureau that will act as our observational unit ) and their inhabitants.

The study begins with exploratory data analysis, employing Box-Cox transformations to address issues of non-normality and heteroscedasticity in the data set. Subsequently, a robust bootstrapping approach is applied across cases then residuals, offering a comprehensive assessment given our inability to make assumptions about the underlying distribution. Through these techniques, we seek to provide a more accurate understanding of the factors influencing house prices across California’s diverse census block groups.

2 Introduction

2.1 Motivation

The motivation behind undertaking this comprehensive data analysis of California house prices lies in the quest to unravel the intricate dynamics influencing one of the most pivotal aspects of the state’s economy – its real estate market. As housing affordability and market trends continue to be of paramount importance for residents, policymakers, and real estate professionals alike, there exists a compelling need to delve into the underlying factors that shape housing prices across diverse census block groups.

This analysis seeks to provide actionable insights for informed decision-making in various spheres. By applying advanced statistical techniques such as Box-Cox transformations and bootstrapping, we aim to not only mitigate data challenges but also enhance the robustness and accuracy of our findings. The significance of this study is amplified by its foundation in the 1990 California census data, offering a historical perspective that enriches our understanding of how socio-economic and spatial factors have shaped the housing landscape over time.

2.2 Data set Description

The data set was found on Kaggle. It includes information gathered from various block groups in California during the 1990 Census. The U.S. Census Bureau uses the block group as the smallest geographical unit, typically consisting of 600 to 3000 people. Each observation in the data set represents one block group and comprises an average of 1425.5 individuals in a geographically compact area.

The data set contains 20640 observations of 9 dependent variables and one independent variable: median house value. The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%203/ca-housing-price.csv.

  • Longitude(X1): How far west block group is located from prime meridian
  • Latitude(X2): How far north block group is located from equator
  • HousingMedianAge(X3): Median age of house within block group
  • TotalRooms(X4): Total number of rooms in block
  • TotalBedrooms(X5): Total number of bedrooms in block
  • Population(X6): Total number of residents in block group
  • Households(X7): Total number of households, defined as a group of people residing within a home unit
  • MedianIncome(X8): Median income, in thousands of US Dollars, for households within a block group
  • MedianHouseValue(Y): Median house value, in US Dollars, in a block
  • OceanProximity(X9): Location of the house to an ocean or sea, displayed as a categorical variable

2.3 Research Question

The primary objective of this analysis is to identify the primary factors that influence house prices amongst California census block groups.

Practical questions that this research aims to explore are as follows:

  • What socioeconomic factors exhibit the largest correlation with median house price?
  • What transformations may need to be conducted on the data, or do no transformations need to be done?

This analysis will seek to explore 2 hypotheses:

  • Hypothesis 1: A reduced model will perform better than the full model.

  • Hypothesis 2: Distance to the ocean is negatively correlated with median house value.

3 Methods

This section outlines the key methodologies employed in the analysis, including multiple linear regression, K-means clustering, Box-Cox transformation, and bootstrapping. Each method is designed to address specific research goals and improve model robustness.

3.1 Multiple Linear Regression

Multiple linear regression (MLR) serves as the foundational model for examining how multiple predictors, such as housing characteristics and demographic factors, influence California house prices. In this analysis, MLR allows us to quantify the relationship between the dependent variable, MedianHouseValue, and several independent variables including median income, population, and geographic location. The initial model will include all predictors to provide a baseline understanding before refining the model based on diagnostics. Key assumptions include linearity of relationships, independence of observations, homoscedasticity (constant variance of residuals), and normally distributed residuals.

3.2 K-means Clustering

To capture the geographical variation of latitude and longitude in a more interpretable and meaningful way, K-means clustering is employed. This method groups California’s block groups into distinct geographic regions, which are used as a new categorical variable, GeoCluster. This clustering technique allows for latitude and longitude to be useful predictors, as considering them as separate predictors is unreliable. An elbow plot is employed to determine the optimal number of clusters. The main assumption is that geographic proximity (as measured by distance) meaningfully influences house prices, and that the number of clusters adequately represents regional differences.

3.3 Box-Cox Transformation

Box-Cox transformation is employed to address violations of regression assumptions, particularly non-normality and heteroscedasticity of residuals. By transforming the dependent variable, MedianHouseValue, the Box-Cox method helps stabilize variance and approximate a normal distribution of residuals, improving the overall fit of the regression model. This transformation assumes that the relationship between the predictors and the dependent variable can be linearized and that the response variable is strictly positive, which holds true in this dataset.

3.4 Bootstrap Residuals

Bootstrapping is used to generate robust confidence intervals for the regression coefficients, especially when the assumptions of normality for residuals are violated. By resampling the dataset 1,000 times with replacement, we estimate the variability in the coefficients without relying on parametric assumptions. Bootstrapping is a flexible, non-parametric approach, with the key assumption being that the sample is representative of the population, and resampling can approximate the true sampling distribution of the model estimates.

3.5 Other Methods

In addition to the core methods listed above, other statistical techniques may be employed, depending on the results of the initial analyses:

  1. Variance Inflation Factor (VIF): This will be used to detect multicollinearity in the multiple regression models.

  2. Residual Diagnostics: Standard diagnostic plots (Q-Q plot, residuals vs. fitted, etc.) will be used to assess the validity of assumptions in the regression model.

  3. Cross-Validation: If beneficial, cross-validation may be used to validate the robustness of the final model and prevent overfitting.

4 Exploratory Data Analysis

The original data set contains 9 independent variables and 1 dependent variable. Before proceeding with model building, it is important to explore the relationships among the variables, identify any patterns, and address potential issues like multicollinearity or irrelevant variables.

4.1 Data Characteristics

These data were originally published in the 1997 paper titled Sparse Spatial Autoregressions by Pace et al. Each observation represents one census block group from the 1990 US Census conducted in California.

There is only one response variable - the median house value of the block group. There are 9 predictor variables consisting of 8 numerical variables and one nominal variable. These numerical variables describe geographic location (longitude, latitude), house characteristics (median house age, total rooms, total bedrooms), resident characteristics (population, households), and socioeconomic characteristics (median household income). The categorical variable very roughly describes 5 categories of relation to the ocean (<1 hour away, inland, near ocean, near bay, or island). The summary statistics are as follows:



cahouses %>%
  summary() %>%
  t() %>%
  kable(format = "markdown", caption = "Summary of Data Set's Statistics", )
Summary of Data Set’s Statistics
longitude Min. :-124.3 1st Qu.:-121.8 Median :-118.5 Mean :-119.6 3rd Qu.:-118.0 Max. :-114.3 NA
latitude Min. :32.54 1st Qu.:33.93 Median :34.26 Mean :35.63 3rd Qu.:37.71 Max. :41.95 NA
housing_median_age Min. : 1.00 1st Qu.:18.00 Median :29.00 Mean :28.64 3rd Qu.:37.00 Max. :52.00 NA
total_rooms Min. : 2 1st Qu.: 1448 Median : 2127 Mean : 2636 3rd Qu.: 3148 Max. :39320 NA
total_bedrooms Min. : 1.0 1st Qu.: 296.0 Median : 435.0 Mean : 537.9 3rd Qu.: 647.0 Max. :6445.0 NA’s :207
population Min. : 3 1st Qu.: 787 Median : 1166 Mean : 1425 3rd Qu.: 1725 Max. :35682 NA
households Min. : 1.0 1st Qu.: 280.0 Median : 409.0 Mean : 499.5 3rd Qu.: 605.0 Max. :6082.0 NA
median_income Min. : 0.4999 1st Qu.: 2.5634 Median : 3.5348 Mean : 3.8707 3rd Qu.: 4.7432 Max. :15.0001 NA
median_house_value Min. : 14999 1st Qu.:119600 Median :179700 Mean :206856 3rd Qu.:264725 Max. :500001 NA
ocean_proximity Length:20640 Class :character Mode :character NA NA NA NA

4.1.1 Handling Missing Values

There are 207 missing values for total_bedrooms. As such, these observations will be removed. While it’s normally not advised to remove observations due to concerns of losing information from the data, there are 20640 observations, meaning that removing 207 observations will have a very minor impact. The benefits of removing the observations with missing values outweigh the drawbacks.

  cahouses <- na.omit(cahouses, cols = "total_bedrooms")
  sum(is.na(cahouses))
## [1] 0

4.1.2 Standardizing Income

There is a discrepancy in units for median income of households and median house value. The units for median income are in tens of thousands of US Dollars whereas the units for median house value are in US Dollars. As such, median income is multiplied by a factor of 10000 so that the units are standardized to US Dollars.

# conversion of income to USD
  cahouses$median_income <- cahouses$median_income * 10000
  summary(cahouses$median_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4999   25637   35365   38712   47440  150001

4.2 Geographic Factors

Longitude and latitude are presented in the data set as 2 separate, standalone variables. However, they should not be used in this way because they do not inherently convey meaningful geographic information when split. Instead, exploratory analysis is conducted to detect whether any patterns occur when both longitude and latitude are used in conjunction. This information can then be used to derive a new categorical variable.

First, a graph of observations’ longitude and latitude are plotted to see if patterns emerge.


ggplot(cahouses, aes(x = longitude, y = latitude)) +
  geom_point(alpha = 0.4, color = 'blue') +
  labs(title = "Scatter Plot of Longitude and Latitude for California Housing Data", 
       x = "Longitude", y = "Latitude") +
  theme_minimal()

This scatter plot helps us visualize our geographic data. Based on my observations, the housing data could be sorted into clusters that represent what part of California best

4.2.1 Creating Geographic Clusters

Before clustering longitude and latitude, we’ll create an elbow plot to help determine the optimal number of clusters.

The plot graphs the total within-cluster sum of squares (WCSS), which measures the compactness of the clusters. As the number of clusters increases, the WCSS decreases because the data points are closer to the centroids of their respective clusters. However, after a certain point (the “elbow”), adding more clusters results in only marginal improvements, indicating that this is the optimal number of clusters. Finding this elbow allows for us to choose a K that appropriately balances model complexity and cluster compactness.


# Subset the data to include only latitude and longitude
geo_data <- cahouses[, c('longitude', 'latitude')]

# Set a range of potential K values (number of clusters)
wcss <- numeric()  # Initialize an empty vector for within-cluster sum of squares

# Calculate WCSS for different numbers of clusters
for (k in 1:10) {
  kmeans_result <- kmeans(geo_data, centers = k)
  wcss[k] <- sum(kmeans_result$tot.withinss)  # Store the WCSS for each K
}

# Create the elbow plot
elbow_plot <- data.frame(K = 1:10, WCSS = wcss)

ggplot(elbow_plot, aes(x = K, y = WCSS)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 3) +
  labs(title = "Elbow Plot for Optimal Number of Clusters", 
       x = "Number of Clusters (K)", 
       y = "Within-Cluster Sum of Squares (WCSS)") +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +  # Ensure whole numbers on x-axis
  ylim(0, 50000) +  # Adjust to fit WCSS scale better
  theme_minimal()

The first cluster in the elbow plot was not provided because it has a WCSS of over 175000. In addition, while there would be a notable drop off in WCSS from k=1 to k=2 as compared to k=2 to k=3, choosing k=2 clusters may be an oversimplification and may mask smaller, yet meaningful geographic variation.

As a result, the elbow plot determines that k=4 clusters will be used for the K-means Clustering.


# Clustering longitude and latitude into regions using k-means
set.seed(123) # Setting a seed for reproducibility
kmeans_result <- kmeans(cahouses[, c('longitude', 'latitude')], centers = 4)

# Add the cluster assignment as a new categorical variable
cahouses$GeoCluster <- as.factor(kmeans_result$cluster)

# Summarize the clustering result
table(cahouses$GeoCluster)
## 
##     1     2     3     4 
##  1771  6743 10913  1006

# Plot the geographic clusters
ggplot(cahouses, aes(x = longitude, y = latitude, color = GeoCluster)) +
  geom_point(alpha = 0.4) +
  labs(title = "K-Means Clustering of Longitude and Latitude", 
       x = "Longitude", y = "Latitude") +
  theme_minimal()

Now, GeoCluster is used as a categorical variable in place of longitude and latitude. This transformation captures geographic patterns and improves interpretability. Longitude and latitude are dropped from further analysis.

4.2.2 Ocean proximity

OceanProximity describes whether the house is near the ocean. This is a critical factor when considering housing prices due to its strong association with real estate value.

Histograms for median house value across different levels of OceanProximity are plotted to visually investigate for any meaningful distributional differences.

par(mfrow = c(1,1))

hist(island$median_house_value, xlab = "Median House Value", main = "ISLAND" )

Upon reviewing these histograms, they all display a distinct right skew. To ensure that the different factor levels of OceanProximity are statistically significant from one another, ANOVA is run.

anova_model <- aov(median_house_value ~ ocean_proximity, data = cahouses)
summary(anova_model)
##                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## ocean_proximity     4 6.479e+13 1.620e+13    1595 <2e-16 ***
## Residuals       20428 2.075e+14 1.016e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova resulted in F(4, 20428) = 1595, p < 0.001. This means that at least one factor level differs from the others. As a result, OceanProximity is still relevant to the study at hand. We’ll perform post-hoc analyses later when the multiple regression model is built.

4.2.3 Chi-Squared Test between GeoCluster and OceanProximity

To ensure that the GeoCluster and OceanProximity variables are not confounding each other, a Chi-Squared Test of Independence is run. This test will check if there is an association between the two categorical variables. If the p-value is significant, we will need to assess the impact of using both variables in the model.

geo_ocean_table <- table(cahouses$GeoCluster, cahouses$ocean_proximity)
chisq.test(geo_ocean_table)
## 
##  Pearson's Chi-squared test
## 
## data:  geo_ocean_table
## X-squared = 9684.5, df = 12, p-value < 2.2e-16

Since Χ^2(12) = 9684.5, p < 0.01 from the chi-squared test of independence, there appears to be an association between geographic cluster and ocean proximity. This concern will be addressed when candidate models are built.

4.3 Socioeconomic Factors

Socioeconomic factors such as population distribution, economic status, and household distributions can significantly influence median house values and will be considered as part of this study.

4.3.1 Median Income

A histogram of median income distribution of the 20433 observations is created.

# income distribution
hist(cahouses$median_income, xlab = "Median Income", main = "Median Income Distribution")

In the distribution for median income, each bar represents an increment of 10,000. The first bar shows that roughly 100 block groups have a median income between 0 and 9,999. The distribution is right-skewed, with fewer than 500 block groups showing a median income of 100,000 or more.

However, we decided not to discretize median income further, as binning can reduce statistical power and cause information loss. Furthermore, the maximum median income observed was 150,001. The fact that exactly 50 block groups have this value suggests that incomes over 150,000 have already been binned in the original dataset.

#proof income was pre-discretized

  sum(cahouses$median_income == 150001)
## [1] 48

  tail(sort(cahouses$median_income), 50)
##  [1] 150000 150000 150001 150001 150001 150001 150001 150001 150001 150001
## [11] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001
## [21] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001
## [31] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001
## [41] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001

4.3.2 Inhabitant Characteristics

The distributions for population and households are plotted below.


#Population distribution
par(mfrow = c(1,2))
hist(cahouses$population, xlab = "Population Count", xlim = c(0, 8000), main = "Population Distribution" ) 
text(x = 5000, y = 15000, labels = "60 Observations >= 8000 ", col = "red", cex = 0.8, pos = 1)
text(x = 5000, y = 14000, labels = "are cut off", col = "red", cex = 0.8, pos = 1)

sum(cahouses$population >= 8000)
## [1] 60

#Household distribution
hist(cahouses$households, xlab = "Household Count", xlim = c(0, 3000), main = "Household Distribution" )
text(x = 1800, y = 11000, labels = "47 Observations >= 3000 ", col = "red", cex = 0.8, pos = 1)
text(x = 1800, y = 10000, labels = "are cut off", col = "red", cex = 0.8, pos = 1)


sum(cahouses$households >= 3000)
## [1] 47

The histograms for population and household counts display right-skewed distributions, with a high frequency of block groups with smaller populations and household counts. Population-dense areas, typically urban or suburban, contribute to this skew by comprising smaller geographic areas. This leads to a concentration of block groups with smaller populations and household sizes. Consequently, it is likely that a transformation will be required to correct these skews before proceeding with the analysis.

4.4 House Characteristics

4.4.1 Median House Age

The distribution of housing median age is shown below.


# house age distribution
hist(cahouses$housing_median_age, xlab = "Age of house", main = "Median House Age Distribution" )



sum(cahouses$housing_median_age == 52)
## [1] 1265

The distribution for housing median age is approximately normal, except for the last bar, which represents a large spike at 52 years old. This suggests that the data for housing age may have been discretized or capped at 52 years, as 1,265 observations are recorded at this exact age. Nonetheless, since normality is largely satisfied, no transformation is required for housing age.

4.4.2 Rooms and Bedrooms

Next, the distributions of total rooms and total bedrooms per block group are displayed.


# Total room distribution
par(mfrow = c(1,2))
hist(cahouses$total_rooms, xlab = "Room Count", xlim = c(0,15000), main = "Total Rooms Distribution" )
text(x = 10000, y = 7500, labels = "97 Observations >= 15000 ", col = "red", cex = 0.8, pos = 1)
text(x = 10000, y = 6800, labels = "are cut off", col = "red", cex = 0.8, pos = 1)

sum(cahouses$total_rooms >= 15000)
## [1] 97


# Total bedroom distribution
hist(cahouses$total_bedrooms, xlab = "Bedroom Count", xlim = c(0, 3000), main = "Total Bedrooms Distribution" )
text(x = 2000, y = 10000, labels = "67 Observations >= 3000 ", col = "red", cex = 0.8, pos = 1)
text(x = 2000, y = 9100, labels = "are cut off", col = "red", cex = 0.8, pos = 1)


sum(cahouses$total_bedrooms >= 3000)
## [1] 67

Both the total rooms and total bedrooms distributions exhibit heavy right-skewed patterns. This is expected, as apartment complexes or smaller homes in densely populated areas tend to contribute lower counts of rooms and bedrooms per block group. Given the skewness, a transformation will likely be required to normalize these variables and improve the model’s performance.

4.5 Multicollinearity and Variable Selection

Multicollinearity occurs when two or more predictor variables in a model are highly correlated, which can obscure the true relationship between each predictor and the response variable. A key step before building the regression model is to check for multicollinearity among the numerical variables.

4.5.1 Pairwise Scatterplot

A pairwise scatterplot is used to visually inspect the relationships between numerical variables and check for highly correlated pairs. The correlation coefficients between the variables are shown on the plot, and higher correlations indicate potential issues with multicollinearity.


pairs.panels(cahouses[, -c(1, 2, 9, 10, 11)], pch=21, main="Pair-wise Scatter Plot of Numerical Variables")

Visually, the pairwise scatterplot reveals a few concerning cases of collinearity. For instance, the correlation between the number of rooms and the number of bedrooms is extremely high (0.98), suggesting these two variables are capturing very similar information. Other large correlations exist between population and households (0.91), and rooms with households (0.92). These strong correlations indicate multicollinearity, which may distort our regression results. In the next section, we formally test for multicollinearity using the Variance Inflation Factor (VIF).

4.5.2 Variance Inflation Factor (VIF) Check

To formally detect multicollinearity, Variance Inflation Factors (VIF) are calculated for the predictors in the initial model. A VIF value greater than 10 typically indicates a high level of multicollinearity, which could affect the stability of the regression coefficients. However, it is important to note that for categorical variables transformed into dummy variables, high VIFs may not always be a cause for concern due to how reference categories are selected.


full_model <- lm(median_house_value~. - longitude - latitude, data = cahouses)
vif(full_model)
##                         GVIF Df GVIF^(1/(2*Df))
## housing_median_age  1.322156  1        1.149850
## total_rooms        12.974415  1        3.602001
## total_bedrooms     36.052098  1        6.004340
## population          6.447227  1        2.539139
## households         34.571458  1        5.879750
## median_income       1.809616  1        1.345220
## ocean_proximity     2.190419  4        1.102976
## GeoCluster          1.870925  3        1.110050

In examining the VIF values, Total Rooms (VIF = 12.97), Total Bedrooms (VIF = 36.05), and Households (VIF = 34.57) exhibit extremely high multicollinearity, which is consistent with our earlier visual inspection of the pairwise scatterplots. These high VIF values suggest that these variables provide overlapping information, and it may be beneficial to consider reducing the number of highly correlated variables to improve the model.

However, it is worth noting that GeoCluster and OceanProximity, which are categorical variables, have moderate VIF values that are not as concerning. High VIFs for dummy variables often arise due to small sample sizes in reference categories. In this case, while the VIF for OceanProximity is 2.19 and GeoCluster is 1.87, these values are not large enough to warrant concern, especially given that the categorical variables are dummy coded.

We will continue to address these multicollinearity issues in our variable selection and modeling steps. Specifically, it may be necessary to drop or transform certain numerical variables (such as Total Bedrooms and Households) to improve model stability, while recognizing that categorical variables may inherently have higher VIF values without seriously affecting the overall model fit.

5 Regression Modeling

After performing the exploratory data analysis and preparing the data, the next step is to develop a regression model. The aim is to build a model that not only meets statistical requirements, such as satisfying assumptions about linearity, normality, and homoscedasticity, but also provides valuable practical insights for decision-making.

Categorical variables with more than 2 factor levels results in the creation of dummy variables. This means the coefficients for the dummy variables compare each category to their respective baseline. Our two categorical variables, Ocean_proximity and GeoCluster, have baselines of <1 hour from ocean and GeoCluster1 respectively.

For instance, if ocean_proximityINLAND has a negative coefficient, it signifies that homes that are inland have lower median value compared to those within <1 hour from the ocean; this suggests that coastal proximity dramatically increases property value. Similarly, GeoCluster2 having a positive coefficient indicates that that region has higher median house values compared to GeoCluster1, implying those regions are more desirable or have higher property demand.

5.1 Full Model

We begin by fitting a full model with all predictors (excluding longitude and latitude, which were previously transformed into geographic clusters). This model will serve as the baseline for comparison.


full_model <- lm(median_house_value ~ . - longitude - latitude, data = cahouses)

kable(round(summary(full_model)$coef, 3), caption = "Coefficient Matrix for Full Linear Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1:5, width = "4cm")
Coefficient Matrix for Full Linear Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18853.505 2912.032 6.474 0.000
housing_median_age 1135.373 44.299 25.630 0.000
total_rooms -6.722 0.800 -8.407 0.000
total_bedrooms 79.902 6.912 11.559 0.000
population -37.601 1.087 -34.593 0.000
households 73.829 7.461 9.895 0.000
median_income 3.947 0.034 114.885 0.000
ocean_proximityINLAND -66400.322 1375.194 -48.284 0.000
ocean_proximityISLAND 173614.592 31036.617 5.594 0.000
ocean_proximityNEAR BAY -4257.368 2003.783 -2.125 0.034
ocean_proximityNEAR OCEAN 13699.215 1561.299 8.774 0.000
GeoCluster2 22843.258 1982.686 11.521 0.000
GeoCluster3 13191.461 1929.395 6.837 0.000
GeoCluster4 -13359.336 2749.640 -4.859 0.000

5.1.1 Full Model Coefficient Interpretation

The coefficients below summarize the relationship between predictor variables and median house value:

  • Housing Median Age: Each additional year of housing median age increases house value by $1,135, likely due to the appeal of established areas.

  • Total Rooms: For each additional room, house value decreases by $6.72. This appears to be paradoxical, but it is likely the result of the collinearity between Total Rooms and Total Bedrooms.

  • Total Bedrooms: Each additional bedroom increases house value by $79.90, reflecting that bedrooms are more valuable than overall room count.

  • Population: For each additional person in the population, house value decreases by $37.60. More populated areas tend to have lower house values, possibly due to crowding or lower space per household.

  • Households: Each additional household increases house value by $73.83, suggesting that higher demand for housing raises prices.

  • Median Income: For every 1 dollar increase in median income, house value increases by 3.95 dollars. Higher-income areas see higher home values.

  • Ocean Proximity: The dummy variable represents the proximity of houses to the ocean.

    • INLAND: Inland properties are associated with a $66,400 decrease in value compared to properties within <1 hour from the ocean.
    • ISLAND: Island properties are associated with a $73,164 increase in value compared to properties within <1 hour from the ocean.
    • NEAR BAY: Properties near the bay are associated with a $4,257 decrease in value compared to those <1 hour from the ocean.
    • NEAR OCEAN: Properties near the ocean see a $13,969 increase in value compared to those <1 hour from the ocean.
  • GeoCluster: The GeoCluster categorical variable captures geographic patterns and their relationship to house value.

    • GeoCluster2: Properties in this cluster are associated with a $22,843 increase in value compared to GeoCluster1.
    • GeoCluster3: Properties in this cluster see a $13,191 increase in value compared to GeoCluster1.
    • GeoCluster4: Properties in this cluster have a $13,359 decrease in value compared to GeoCluster1.

5.1.2 Full Model Residual Analysis

To assess the model, we check for any violations of regression assumptions by examining residual plots:

par(mfrow = c(2,2))
plot(full_model)

  • Residuals vs Fitted: There’s a clear non-linearity, especially for higher fitted values. The relationship between predictors and house values may not be fully captured by a linear model.

  • Normal Q-Q: Deviations from normality in the tails suggest that some residuals, particularly from influential points like observations 15361 and 18502, are outliers. This affects the model’s validity for hypothesis testing.

  • Scale-Location: The increasing spread of residuals with higher fitted values suggests heteroscedasticity, indicating non-constant variance. A transformation of the response variable could help address this.

  • Residuals vs Leverage: Influential points such as 15361 and 18502 have high leverage and may be disproportionately impacting the model. These points require further investigation, and removing them could provide more stable results.

To address these violations, transformations such as a Box-Cox or log transformation of the response variable should be considered. Additionally, handling influential points and reducing multicollinearity can improve model stability and predictive performance.

5.2 Reduced Model

The primary goal of developing a reduced model is to alleviate the multicollinearity present in the full model, as identified through the high VIF values for certain predictors. Reducing multicollinearity ensures more stable and interpretable regression coefficients, leading to a model that balances simplicity and predictive performance.

Additionally, we aim to create a parsimonious model, which is a model that explains the data with the fewest possible variables without sacrificing explanatory power. This is important because a simpler model is easier to interpret and often avoids overfitting, while still capturing the significant relationships between predictors and house values.

To address multicollinearity, we will iteratively remove the following variables:

  • Total Rooms: Has high VIF and could be collinear with Total Bedrooms. Total Bedrooms typically is a better indicator of house value.
  • Population: Another variable with potential collinearity with Households.
  • Households: Has collinearity with population and total rooms.
  • A combination of the three: The likely reduced model with low multicollinearity is one where multiple variables are taken out.

We will then compare the goodness-of-fit measures to determine which reduced model performs best.

5.2.1 Reduced Model Criterion

We create three reduced models by removing each of the identified variables.


# Reduced Model 1: Remove Total Rooms
r_model1 <- lm(median_house_value ~ . - total_rooms - longitude - latitude, data = cahouses)

# Reduced Model 2: Remove Population
r_model2 <- lm(median_house_value ~ . - population - longitude - latitude, data = cahouses)

# Reduced Model 3: Remove Both Total Rooms and Population
r_model3 <- lm(median_house_value ~ . - total_rooms - population - longitude - latitude, data = cahouses)

# Reduced Model 4: All 3 variables
r_model4 <- lm(median_house_value ~ . - total_rooms - population - households - longitude - latitude, data = cahouses)

Next, we compare the goodness-of-fit metrics across the models to identify which one offers the best balance of model fit and simplicity.


select=function(m){ # m is an object: model
 e = m$resid                           # residuals
 n0 = length(e)                        # sample size
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
 R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
 R.adj=summary(m)$adj.r                # Adjusted R square
 MSE=(summary(m)$sigma)^2              # square error
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
 X=model.matrix(m)                     # design matrix of the model
 H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
 d=e/(1-diag(H))
 PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl = round(tbl, 2)
 tbl
 }


## Edited this because the original kable was difficult to read.

output <- rbind(select(r_model1), select(r_model2), select(r_model3), select(r_model4))

row.names(output) <- c("Total Rooms Removed", "Population Removed", "Total Rooms and Population Removed", "All 3 Removed")

# kable(output, caption = "Goodness-of-fit Measures of Candidate Models")

kable(round(output, 3), caption = "Goodness-of-fit Measures of Candidate Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1:5, width = "4cm")
Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
Total Rooms Removed 9.851925e+13 0.64 0.64 13 455607.7 455710.7 9.888029e+13
Population Removed 1.039333e+14 0.62 0.62 13 456700.8 456803.8 1.040939e+14
Total Rooms and Population Removed 1.056478e+14 0.61 0.61 12 457033.1 457128.2 1.057947e+14
All 3 Removed 1.061426e+14 0.61 0.61 11 457126.6 457213.7 1.062716e+14

vif(r_model1)
##                         GVIF Df GVIF^(1/(2*Df))
## housing_median_age  1.319456  1        1.148676
## total_bedrooms     26.725090  1        5.169632
## population          5.938604  1        2.436925
## households         34.460559  1        5.870312
## median_income       1.149373  1        1.072088
## ocean_proximity     2.088842  4        1.096448
## GeoCluster          1.852732  3        1.108244

vif(r_model2)
##                         GVIF Df GVIF^(1/(2*Df))
## housing_median_age  1.320976  1        1.149337
## total_rooms        11.950860  1        3.457002
## total_bedrooms     33.556875  1        5.792830
## households         26.158402  1        5.114529
## median_income       1.741332  1        1.319595
## ocean_proximity     2.169921  4        1.101680
## GeoCluster          1.849542  3        1.107926

vif(r_model3)
##                         GVIF Df GVIF^(1/(2*Df))
## housing_median_age  1.316856  1        1.147544
## total_bedrooms     26.159420  1        5.114628
## households         25.906529  1        5.089846
## median_income       1.148184  1        1.071534
## ocean_proximity     2.064508  4        1.094844
## GeoCluster          1.839177  3        1.106889

vif(r_model4)
##                        GVIF Df GVIF^(1/(2*Df))
## housing_median_age 1.315232  1        1.146836
## total_bedrooms     1.151649  1        1.073149
## median_income      1.143169  1        1.069191
## ocean_proximity    2.025960  4        1.092267
## GeoCluster         1.829524  3        1.105918


###

The VIF values are still really large, I’m not quite sure I can manually reduce in a manner that gives a quality candidate model. For now, we’ll move to automatic variable selection to identify our reduced models.

5.3 Stepwise AIC

We begin by applying Stepwise AIC selection to reduce the model’s complexity and improve interpretability. StepAIC minimizes the Akaike Information Criterion (AIC), which balances model fit and complexity. This method iteratively adds or removes variables, choosing the model with the lowest AIC.


# Fit the full model
full_model <- lm(median_house_value ~ . - longitude - latitude, data = cahouses)

# Apply stepwise selection based on AIC
step_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# Check the summary of the stepwise-selected model
kable(round(summary(step_model)$coef, 3), caption = "Coefficient Matrix for StepAIC-Selected Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1:5, width = "4cm")
Coefficient Matrix for StepAIC-Selected Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18853.505 2912.032 6.474 0.000
housing_median_age 1135.373 44.299 25.630 0.000
total_rooms -6.722 0.800 -8.407 0.000
total_bedrooms 79.902 6.912 11.559 0.000
population -37.601 1.087 -34.593 0.000
households 73.829 7.461 9.895 0.000
median_income 3.947 0.034 114.885 0.000
ocean_proximityINLAND -66400.322 1375.194 -48.284 0.000
ocean_proximityISLAND 173614.592 31036.617 5.594 0.000
ocean_proximityNEAR BAY -4257.368 2003.783 -2.125 0.034
ocean_proximityNEAR OCEAN 13699.215 1561.299 8.774 0.000
GeoCluster2 22843.258 1982.686 11.521 0.000
GeoCluster3 13191.461 1929.395 6.837 0.000
GeoCluster4 -13359.336 2749.640 -4.859 0.000

The stepwise regression model removed none of the variables, making it identical to the full model. This is because the full model must have the lowest AIC already. However, because this model still suffers from multicollinearity, we will proceed to a different method for automatic variable selection.

5.4 LASSO Regression

Lasso (Least Absolute Shrinkage and Selection Operator) is a regularization technique that penalizes the absolute size of regression coefficients, shrinking less important coefficients to zero and effectively performing variable selection. This method is particularly useful when dealing with multicollinearity, and it helps create simpler, more interpretable models by automatically selecting the most relevant predictors.

We apply Lasso Regression to identify the most important variables and shrink the less relevant ones. The final model will only include predictors that contribute meaningfully to predicting median house value.


library(glmnet)

# Prepare the data for glmnet (requires matrix for predictors)
X <- model.matrix(median_house_value ~ . - longitude - latitude, cahouses)[,-1]
Y <- cahouses$median_house_value

# Fit Lasso model with cross-validation to find the best lambda
lasso_model <- cv.glmnet(X, Y, alpha = 1)

# Plot the cross-validated mean squared error for different values of lambda
plot(lasso_model)


# Get the best lambda that minimizes the cross-validated error
best_lambda <- lasso_model$lambda.min

# Fit the final Lasso model using the best lambda
final_lasso <- glmnet(X, Y, alpha = 1, lambda = best_lambda)

# Extract the coefficients of the selected variables
lasso_coefs <- coef(final_lasso)
lasso_df <- as.data.frame(as.matrix(lasso_coefs))  # Convert sparse matrix to dataframe
colnames(lasso_df) <- c("Coefficient")

# Display coefficients in a table
kable(round(lasso_df, 3), caption = "Coefficient Matrix for Lasso-Selected Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1:2, width = "4cm")
Coefficient Matrix for Lasso-Selected Model
Coefficient
(Intercept) 20641.436
housing_median_age 1128.488
total_rooms -5.847
total_bedrooms 75.154
population -37.417
households 73.640
median_income 3.922
ocean_proximityINLAND -66688.343
ocean_proximityISLAND 169482.772
ocean_proximityNEAR BAY -3716.088
ocean_proximityNEAR OCEAN 13553.028
GeoCluster2 22153.547
GeoCluster3 12754.400
GeoCluster4 -13715.931

5.4.1 No variables removed by LASSO

The Lasso regression did not remove any variables, which indicates that all the predictors in the model have some level of significance in explaining the variance in house prices. This outcome suggests that despite the multicollinearity observed earlier, each predictor contributes meaningfully to the model. However, while Lasso didn’t eliminate variables, it did shrink their coefficients, which has important implications for the model’s interpretation and performance.

By shrinking the magnitude of the coefficients, Lasso helps mitigate the overfitting that can arise when certain predictors have an overly strong influence on the model. In the full model, some predictors had relatively large coefficients, which could result in an unstable model, particularly when new data are introduced. The reduced coefficient values in the Lasso model indicate that while these predictors still contribute to house price predictions, their individual impact is more tempered, which helps improve the generalizability of the model to unseen data.

The fact that Lasso did not remove any predictors also implies that multicollinearity is still present in the model. However, by shrinking the coefficients, Lasso reduces the sensitivity of the model to multicollinearity. While this doesn’t completely resolve the issue of multicollinearity, it makes the model less prone to the erratic changes in coefficients that can occur when collinear predictors are included.

In practice, this means that while the Lasso model is likely to generalize better to new data than the full model, it still contains all of the original predictors, which could complicate interpretation. Each predictor still plays a role in explaining house prices, but their contributions are reduced, resulting in a more stable, but slightly less interpretable, model. The implications of this are twofold: Lasso provides a safeguard against overfitting while still allowing for the full complexity of the data to be captured. However, if the goal is to simplify the model further, other strategies—such as variable interactions or further domain-driven selection—may be needed.

Overall, the Lasso model strikes a balance between retaining predictor significance and ensuring a more robust, generalizable model. By reducing the influence of individual variables without eliminating them, Lasso maintains the explanatory power of the model while making it more resistant to overfitting.

5.4.2 LASSO Coefficient Interpretation

The coefficients above summarize the relationship between predictor variables and median house value after Lasso regularization:

  • Housing Median Age: Each additional year of housing median age increases house value by $1,128, slightly reduced from the full model. This implies that the age of the house remains a strong positive factor in house valuation.

  • Total Rooms: For each additional room, house value decreases by $5.85, compared to $6.72 in the full model. This reduction suggests that Lasso minimized the influence of total rooms but did not remove it entirely.

  • Total Bedrooms: Each additional bedroom increases house value by $75.15, slightly smaller than in the full model ($79.90), indicating that bedrooms are still valued more than overall room count but with reduced impact.

  • Population: For each additional person in the population, house value decreases by $37.41, nearly identical to the full model. More populated areas still tend to have lower house values.

  • Households: Each additional household increases house value by $73.64, again similar to the full model. This continues to suggest that higher demand for housing raises property prices.

  • Median Income: For every 1 dollar increase in median income, house value increases by $3.92, which is consistent with the full model. Higher-income areas still correspond to higher house prices.

  • Ocean Proximity:

    • INLAND: Inland properties are associated with a $66,688 decrease in value compared to properties within <1 hour from the ocean.
    • ISLAND: Island properties are associated with a $169,482 increase in value compared to properties within <1 hour from the ocean.
    • NEAR BAY: Properties near the bay are associated with a $3,716 decrease in value compared to those <1 hour from the ocean.
    • NEAR OCEAN: Properties near the ocean are associated with a $13,553 increase in value compared to those <1 hour from the ocean.
  • GeoCluster:

    • GeoCluster2: Properties in this cluster are associated with a $22,153 increase in value compared to GeoCluster1.
    • GeoCluster3: Properties in this cluster see a $12,754 increase in value compared to GeoCluster1.
    • GeoCluster4: Properties in this cluster have a $13,715 decrease in value compared to GeoCluster1.

5.4.3 LASSO Residual Analysis

We now assess the Lasso model by checking for any violations of regression assumptions through residual plots:


# Predict values using the Lasso model
yhat_lasso <- predict(final_lasso, newx = X)

# Calculate residuals
residuals_lasso <- Y - yhat_lasso

# Plot residuals against predicted values
par(mfrow = c(2, 2))

# Residuals vs Fitted (Predicted) values
plot(yhat_lasso, residuals_lasso, 
     main = "Residuals vs Fitted", 
     xlab = "Fitted values (Predicted)", 
     ylab = "Residuals")
abline(h = 0, col = "red")

# Q-Q plot for normality
qqnorm(residuals_lasso, main = "Normal Q-Q Plot")
qqline(residuals_lasso, col = "red")

# Scale-location plot (for homoscedasticity)
sqrt_residuals <- sqrt(abs(residuals_lasso))
plot(yhat_lasso, sqrt_residuals, 
     main = "Scale-Location", 
     xlab = "Fitted values (Predicted)", 
     ylab = "Sqrt(|Residuals|)")
abline(h = 0, col = "red")

# Residuals vs Leverage plot
# For this plot, we need the hat matrix (leverage) values
hat_values <- hatvalues(lm(Y ~ X))  # Approximate hat values
plot(hat_values, residuals_lasso, 
     main = "Residuals vs Leverage", 
     xlab = "Leverage", 
     ylab = "Residuals")
abline(h = 0, col = "red")


par(mfrow = c(1, 1))  # Reset to default plotting window
  • Residuals vs Fitted: A slight curvature suggests some non-linearity, especially for higher fitted values. However, given the large sample size, the model still captures the overall trend effectively, and minor deviations are not a major concern.

  • Normal Q-Q Plot: The tails deviate from normality, reflecting a few outliers. With a large dataset, small deviations from normality are expected and don’t severely impact the model’s overall validity.

  • Scale-Location Plot: There is some heteroscedasticity (increasing spread of residuals), but the large sample size mitigates its impact on inference, ensuring parameter estimates remain reliable.

  • Residuals vs Leverage Plot: A few high-leverage points are observed, but they have minimal impact due to the large number of observations, reducing the overall influence of these outliers.

In summary, while minor assumption violations exist, the large sample size helps mitigate their effects, allowing the model to perform well overall.