Abstract

This statistical report delves into the intricate dynamics of California house prices, utilizing a modified version of the California Housing data set derived from the 1990 census data. Employing methodologies such as Box-Cox transformations and bootstrapping with cases and residuals, this analysis aims to uncover nuanced patterns and relationships within this housing market.

Key predictors that were analyzed spanned geographical location, housing characteristics, demographic details, and economic indicators of block groups and their inhabitants.

The study begins with exploratory data analysis, employing Box-Cox transformations to address issues of non-normality and heteroscedasticity in the data set. Subsequently, a robust bootstrapping approach is applied across cases then residuals, offering a comprehensive assessment given our inability to make assumptions about the underlying distribution. Through these techniques, we seek to provide a more accurate understanding of the factors influencing house prices across California’s diverse census block groups.

Section 1 Introduction


1.1 Motivation

The motivation behind undertaking this comprehensive data analysis of California house prices lies in the quest to unravel the intricate dynamics influencing one of the most pivotal aspects of the state’s economy – its real estate market. As housing affordability and market trends continue to be of paramount importance for residents, policymakers, and real estate professionals alike, there exists a compelling need to delve into the underlying factors that shape housing prices across diverse census block groups.

This analysis seeks to provide actionable insights for informed decision-making in various spheres. By applying advanced statistical techniques such as Box-Cox transformations and bootstrapping, we aim to not only mitigate data challenges but also enhance the robustness and accuracy of our findings. The significance of this study is amplified by its foundation in the 1990 California census data, offering a historical perspective that enriches our understanding of how socio-economic and spatial factors have shaped the housing landscape over time.

1.2 Data set Description

The data set was found on Kaggle. It includes information gathered from various block groups in California during the 1990 Census. The U.S. Census Bureau uses the block group as the smallest geographical unit, typically consisting of 600 to 3000 people. Each observation in the data set represents one block group and comprises an average of 1425.5 individuals in a geographically compact area.

This data set is a modified version of the data originally collected by the US Census Bureau: Certain values were removed from the original data set; In addition, a categorical dependent variable, Ocean Proximity, was created. As a result, the data set contains 20640 observations of 9 dependent variables and one independent variable: median house value.

The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%203/ca-housing-price.csv.

  • Longitude(X1): How far west block group is located from prime meridian
  • Latitude(X2): How far north block group is located from equator
  • HousingMedianAge(X3): Median age of house within block group
  • TotalRooms(X4): Total number of rooms in block
  • TotalBedrooms(X5): Total number of bedrooms in block
  • Population(X6): Total number of residents in block group
  • Households(X7): Total number of households, defined as a group of people residing within a home unit
  • MedianIncome(X8): Median income, in thousands of US Dollars, for households within a block group
  • MedianHouseValue(Y): Median house value, in US Dollars, in a block
  • OceanProximity(X9): Location of the house to an ocean or sea, displayed as a categorical variable

1.3 Research Question

The primary objective of this analysis is to identify the primary factors that influence house prices amongst California census block groups.

Practical questions that this research aims to explore are as follows:

  • What socioeconomic factors exhibit the largest correlation with median house price?
  • What transformations may need to be conducted on the data, or do no transformations need to be done?

This analysis will seek to explore 2 hypotheses:

  • Hypothesis 1: A reduced model will perform better than the full model.

  • Hypothesis 2: Distance to the ocean is negatively correlated with median house value.

2 Methods


2.1 Data Characterisitics

These data were originally published in the 1997 paper titled Sparse Spatial Autoregressions by Pace et al. Each observation represents one census block group from the 1990 US Census conducted in California.

There is only one response variable - the median house value of the block group. There are 9 predictor variables consisting of 8 numerical variables and one nominal variable. These numerical variables describe geographic location (longitude, latitude), house characteristics (median house age, total rooms, total bedrooms), resident characteristics (population, households), and socioeconomic characteristics (median household income). The categorical variable very roughly describes 5 categories of relation to the ocean (<1 hour away, inland, near ocean, near bay, or island).


Summary of Data Set’s Statistics
longitude Min. :-124.3 1st Qu.:-121.8 Median :-118.5 Mean :-119.6 3rd Qu.:-118.0 Max. :-114.3 NA
latitude Min. :32.54 1st Qu.:33.93 Median :34.26 Mean :35.63 3rd Qu.:37.71 Max. :41.95 NA
housing_median_age Min. : 1.00 1st Qu.:18.00 Median :29.00 Mean :28.64 3rd Qu.:37.00 Max. :52.00 NA
total_rooms Min. : 2 1st Qu.: 1448 Median : 2127 Mean : 2636 3rd Qu.: 3148 Max. :39320 NA
total_bedrooms Min. : 1.0 1st Qu.: 296.0 Median : 435.0 Mean : 537.9 3rd Qu.: 647.0 Max. :6445.0 NA’s :207
population Min. : 3 1st Qu.: 787 Median : 1166 Mean : 1425 3rd Qu.: 1725 Max. :35682 NA
households Min. : 1.0 1st Qu.: 280.0 Median : 409.0 Mean : 499.5 3rd Qu.: 605.0 Max. :6082.0 NA
median_income Min. : 0.4999 1st Qu.: 2.5634 Median : 3.5348 Mean : 3.8707 3rd Qu.: 4.7432 Max. :15.0001 NA
median_house_value Min. : 14999 1st Qu.:119600 Median :179700 Mean :206856 3rd Qu.:264725 Max. :500001 NA
ocean_proximity Length:20640 Class :character Mode :character NA NA NA NA

A summary of the statistics’ distributions are made in the table above. Our 9 numerical variables have descriptive statistics describing minimum value, 1st quartile/25th percentile, median/50th percentile, mean, 3rd quartile/75th percentile, and maximum values. For our categorical variable it shows count. For all variables the last column counts the total number of NAs or none values. Our data set no missing values except 207 missing values for total bedrooms. This issue will be resolved by imputing the missing values, which is the process of replacing missing values with a statistical estimate based off other observations. In this case, we are imputing the value with 435, the median number of total bedrooms per block group.

After imputing the data with the median, the new total number of NAs in the Total Bedrooms variable is 0.

# proof of imputation
sum(is.na(cahouses$total_bedrooms))
FALSE [1] 0


2.2 Exploratory Data Analysis

Exploratory data analysis is conducted to assess the distributions of key variables.

2.2.1 Geographic Location

There are 3 variables that describe geographic locations of houses: longitude, latitude, and ocean_proximity. A plot was first created that plots house locations by longitude and latitude.

The longitude and latitude values plotted depict precise values that describe the data in terms of geographical location.

Next, 5 histograms were plotted based off what value each observation had for OceanProximity.

As depicted above, despite there being 5 unique values for OceanProximity, <1H from ocean, Near ocean, and Near bay all have distributions that closely resemble one another. This is a concern that may be addressed later if multicollinearity is found between longitude, latitude, and ocean proximity.

2.2.2 Median Income

There is a discrepancy in units for median income of households and median house value. The units for median income are in tens of thousands of US Dollars whereas the units for median house value are in US Dollars. As such, unit conversion for median income was performed to convert units to US Dollars to be more consistent.

In the distribution for median income, each bar represents an increment of 10000. So the first bar represents that roughly 100 block groups have a median income of 0-9999 US Dollars. The median income has a right skew as there are roughly less than 500 block groups that have a median income of 100k US Dollars or more. However, a decision was made to not discretize median income as the resulting binning will reduce statistical power and lead to information loss. Furthermore, 50 observations of 150001 demonstrates that most likely the median income was already binned for any amount found over $150000.

#proof income was pre-discretized
tail(sort(cahouses$median_income), 50)
FALSE  [1] 150000 150001 150001 150001 150001 150001 150001 150001 150001 150001
FALSE [11] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001
FALSE [21] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001
FALSE [31] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001
FALSE [41] 150001 150001 150001 150001 150001 150001 150001 150001 150001 150001

2.2.3 House Characteristics

The distributions for Housing Median Age, Total Rooms, and Total Bedrooms is plotted below.

The distribution for the median house age looks relatively normal, except the last bar for the histogram, which is unusually large. Upon closer inspection, this is a result of the data being discretized already. The last 500+ median age observations are 52 years old; this is likely a result of all houses aged 50+ being recorded as 52 years old. No transformation is likely required as normality is roughly satisfied.

The plots for total rooms and total bedrooms both exhibit heavy right skews. This is unsurprising as apartment complexes tend to have a smaller number of rooms; since the census block group factors in population into its calculation of what geographic area encompasses a block group, dense populations will create a higher number of low room count block groups. As a result, a transformation is likely to be required to correct these skews.

2.2.4 Inhabitant Characteristics

The distributions for population and households are plotted below.

Similar to the logic behind total rooms and total bedrooms, population dense areas contribute to smaller geographic areas such as urban and suburban areas; this leads to a larger frequency of block groups with a smaller population and household count. As a result, a transformation is likely to be required to correct these skews.

2.2.5 VIFs and Pairwise Scatterplots

VIFs or Variance Inflation Factors quantify the inflation of coefficients due to collinearity. Higher multicollinearity indicate stronger multicollinearity, which can lead to less reliable models. A VIF of 1 indicates zero collinearity, a VIF of 4 indicates potential multicollinearity, and a VIF of 10 indicates a strong presence of multicollinearity.

FALSE                         GVIF Df GVIF^(1/(2*Df))
FALSE longitude          18.028444  1        4.245992
FALSE latitude           19.925764  1        4.463828
FALSE housing_median_age  1.321927  1        1.149751
FALSE total_rooms        12.349114  1        3.514131
FALSE total_bedrooms     27.040073  1        5.200007
FALSE population          6.342122  1        2.518357
FALSE households         28.315383  1        5.321220
FALSE median_income       1.740468  1        1.319268
FALSE ocean_proximity     4.090317  4        1.192531

The GVIF^(1/(2*Df)) value helps measure multicollinearity. It quantifies how much variance increases if predictor variables are correlated. The adjusted GVIF variables are adjusted for degrees of freedom. An adjusted GVIF value of below 5 indicates little to no multicollinearity. The variables Total Bedrooms and Households are slightly above an adjusted GVIF value of 5, so that may be of concern going forward.

The pair-wise is conducted on all variables except the response variable Median House Value and the categorical variable Ocean Proximity. The response variable should not be factored into the pair-wise comparisons as its correlation is irrelevant to reducing the model. The categorical variable is omitted because it pair-wise scatter plots are generally used for only numerical variables and displaying the relationship between them.

The pair-wise scatter plot of these 8 numerical variables shows 2 groups of variables that are heavily correlated with one another: Longitude with Latitude; and Total Rooms, Total Bedrooms, Population, and Households all with each other. Within each group, the correlation was found to be r > 0.80 which represents a large correlation. This demonstrates that there is some collinearity present between pairs of predictors as well.

As a result of these VIFs and pair-wise scatter plots, it is recommended that potential multicollinearity should be examined further and see if removing any predictors is useful. These reduced models will then be made candidate models.

3 Regression Modeling

After doing our exploratory data analysis and preparing the data, next steps are to model the regression with an appropriate linear model that not only satisfies the relationship, but also satisfies any underlying assumptions about the data.


3.1 Model Refinement

From Section 2.2.5 we discovered potential multicollinearity in the full model; and from Sections 2.2.2, 2.2.3, and 2.2.4, we see that Median Income, Total Rooms, Total Bedrooms, Population, and Households are all right-skewed. While it’s important to address both concerns, addressing multicollinearity first helps so that we work with reduced models where multicollinearity is not present.

3.2 Full Model Transformation

The full model is the linear model that includes all predictors.

Coefficient Matrix for Full Linear Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2233091.969942 87478.9016089 -25.527206 0.0000000
longitude -26430.444664 1013.8905512 -26.068341 0.0000000
latitude -25173.279834 999.8252471 -25.177680 0.0000000
housing_median_age 1057.816131 43.7058419 24.203083 0.0000000
total_rooms -4.730582 0.7706335 -6.138563 0.0000000
total_bedrooms 71.344944 5.9318763 12.027382 0.0000000
population -39.287456 1.0639024 -36.927687 0.0000000
households 77.804415 6.6585750 11.684845 0.0000000
median_income 3.876045 0.0332222 116.670336 0.0000000
ocean_proximityINLAND -39766.398744 1736.2502441 -22.903610 0.0000000
ocean_proximityISLAND 156065.719822 30772.4819840 5.071600 0.0000004
ocean_proximityNEAR BAY -3697.401661 1906.0410182 -1.939833 0.0524136
ocean_proximityNEAR OCEAN 4758.753612 1562.7233999 3.045167 0.0023284


Residual plots are also conducted to check for assumptions.

The residual plots show minor violations in linearity, constant variance, and normality. As mentioned in Section 2.2.5, the full model has a heavy presence of multicollinearity, making it a poor model to predict median house value.

3.2.1 Addressing Multicollinearity

In order to address multicollinearity, it is good practice to remove one of the correlated variables. In this case, the highest adjusted GVIFs are in Total Bedrooms and Households, so those will be iteratively removed to assess for reduced model quality.

FALSE                         GVIF Df GVIF^(1/(2*Df))
FALSE longitude          17.877996  1        4.228238
FALSE latitude           19.838932  1        4.454092
FALSE housing_median_age  1.319408  1        1.148655
FALSE total_rooms         9.697980  1        3.114158
FALSE population          6.026165  1        2.454825
FALSE households         11.773368  1        3.431234
FALSE median_income       1.567345  1        1.251936
FALSE ocean_proximity     4.066419  4        1.191658
FALSE 
FALSE Call:
FALSE lm(formula = median_house_value ~ . - total_bedrooms, data = cahouses)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -535915  -43323  -10823   29169  837443 
FALSE 
FALSE Coefficients:
FALSE                             Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)               -2.122e+06  8.729e+04 -24.308  < 2e-16 ***
FALSE longitude                 -2.532e+04  1.013e+03 -24.988  < 2e-16 ***
FALSE latitude                  -2.438e+04  1.001e+03 -24.352  < 2e-16 ***
FALSE housing_median_age         1.035e+03  4.382e+01  23.618  < 2e-16 ***
FALSE total_rooms               -4.360e-01  6.853e-01  -0.636   0.5246    
FALSE population                -4.214e+01  1.041e+00 -40.497  < 2e-16 ***
FALSE households                 1.390e+02  4.309e+00  32.265  < 2e-16 ***
FALSE median_income              3.750e+00  3.164e-02 118.536  < 2e-16 ***
FALSE ocean_proximityINLAND     -4.107e+04  1.739e+03 -23.616  < 2e-16 ***
FALSE ocean_proximityISLAND      1.628e+05  3.087e+04   5.274 1.35e-07 ***
FALSE ocean_proximityNEAR BAY   -3.082e+03  1.912e+03  -1.612   0.1070    
FALSE ocean_proximityNEAR OCEAN  5.043e+03  1.568e+03   3.216   0.0013 ** 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 68970 on 20628 degrees of freedom
FALSE Multiple R-squared:  0.643,   Adjusted R-squared:  0.6428 
FALSE F-statistic:  3377 on 11 and 20628 DF,  p-value: < 2.2e-16
FALSE                         GVIF Df GVIF^(1/(2*Df))
FALSE longitude          17.735744  1        4.211383
FALSE latitude           19.706970  1        4.439253
FALSE housing_median_age  1.321463  1        1.149549
FALSE total_rooms        12.316689  1        3.509514
FALSE total_bedrooms     11.243102  1        3.353074
FALSE population          4.880018  1        2.209076
FALSE median_income       1.740394  1        1.319240
FALSE ocean_proximity     4.085804  4        1.192366
FALSE 
FALSE Call:
FALSE lm(formula = median_house_value ~ . - households, data = cahouses)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -553232  -42896  -10536   29110  685840 
FALSE 
FALSE Coefficients:
FALSE                             Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)               -2.369e+06  8.698e+04 -27.241  < 2e-16 ***
FALSE longitude                 -2.794e+04  1.009e+03 -27.693  < 2e-16 ***
FALSE latitude                  -2.640e+04  9.976e+02 -26.461  < 2e-16 ***
FALSE housing_median_age         1.067e+03  4.384e+01  24.346  < 2e-16 ***
FALSE total_rooms               -4.269e+00  7.722e-01  -5.529 3.26e-08 ***
FALSE total_bedrooms             1.243e+02  3.838e+00  32.397  < 2e-16 ***
FALSE population                -3.332e+01  9.363e-01 -35.585  < 2e-16 ***
FALSE median_income              3.879e+00  3.333e-02 116.367  < 2e-16 ***
FALSE ocean_proximityINLAND     -4.013e+04  1.742e+03 -23.040  < 2e-16 ***
FALSE ocean_proximityISLAND      1.487e+05  3.087e+04   4.817 1.47e-06 ***
FALSE ocean_proximityNEAR BAY   -3.750e+03  1.912e+03  -1.961  0.04988 *  
FALSE ocean_proximityNEAR OCEAN  4.395e+03  1.568e+03   2.804  0.00506 ** 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 68960 on 20628 degrees of freedom
FALSE Multiple R-squared:  0.6431,  Adjusted R-squared:  0.6429 
FALSE F-statistic:  3379 on 11 and 20628 DF,  p-value: < 2.2e-16
FALSE                         GVIF Df GVIF^(1/(2*Df))
FALSE longitude          17.734162  1        4.211195
FALSE latitude           19.697471  1        4.438183
FALSE housing_median_age  1.318726  1        1.148358
FALSE total_rooms         4.817136  1        2.194797
FALSE population          4.564656  1        2.136506
FALSE median_income       1.337037  1        1.156303
FALSE ocean_proximity     4.013501  4        1.189708
FALSE 
FALSE Call:
FALSE lm(formula = median_house_value ~ . - total_bedrooms - households, 
FALSE     data = cahouses)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -500267  -45195  -11777   30104  504398 
FALSE 
FALSE Coefficients:
FALSE                             Estimate Std. Error t value  Pr(>|t|)    
FALSE (Intercept)               -2.353e+06  8.916e+04 -26.390   < 2e-16 ***
FALSE longitude                 -2.825e+04  1.034e+03 -27.314   < 2e-16 ***
FALSE latitude                  -2.711e+04  1.022e+03 -26.514   < 2e-16 ***
FALSE housing_median_age         1.003e+03  4.490e+01  22.335   < 2e-16 ***
FALSE total_rooms                1.525e+01  4.950e-01  30.808   < 2e-16 ***
FALSE population                -2.561e+01  9.283e-01 -27.586   < 2e-16 ***
FALSE median_income              3.359e+00  2.995e-02 112.156   < 2e-16 ***
FALSE ocean_proximityINLAND     -4.677e+04  1.773e+03 -26.377   < 2e-16 ***
FALSE ocean_proximityISLAND      1.527e+05  3.164e+04   4.826 0.0000014 ***
FALSE ocean_proximityNEAR BAY   -1.346e+03  1.959e+03  -0.687   0.49200    
FALSE ocean_proximityNEAR OCEAN  4.385e+03  1.607e+03   2.729   0.00636 ** 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 70690 on 20629 degrees of freedom
FALSE Multiple R-squared:  0.6249,  Adjusted R-squared:  0.6248 
FALSE F-statistic:  3437 on 10 and 20629 DF,  p-value: < 2.2e-16

While all 3 reduced models have no multicollinearity, we are also concerned with whether the reduced model has significant coefficients. By removing Total Bedrooms, it makes either Total Rooms or dummy variable Ocean Proximity - near bay not statistically significant. Therefore, the reduced model we’re going with is the reduced model where only the Households predictor variable is removed.

3.2.2 Stepwise Regression

Step-wise selection is run on the full model to determine which predictors that could be removed to improve the data.

#stepwise regression
step_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(step_model)
FALSE 
FALSE Call:
FALSE lm(formula = median_house_value ~ longitude + latitude + housing_median_age + 
FALSE     total_rooms + total_bedrooms + population + households + 
FALSE     median_income + ocean_proximity, data = cahouses)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -550613  -42739  -10602   28750  794919 
FALSE 
FALSE Coefficients:
FALSE                             Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)               -2.233e+06  8.748e+04 -25.527  < 2e-16 ***
FALSE longitude                 -2.643e+04  1.014e+03 -26.068  < 2e-16 ***
FALSE latitude                  -2.517e+04  9.998e+02 -25.178  < 2e-16 ***
FALSE housing_median_age         1.058e+03  4.371e+01  24.203  < 2e-16 ***
FALSE total_rooms               -4.731e+00  7.706e-01  -6.139 8.48e-10 ***
FALSE total_bedrooms             7.134e+01  5.932e+00  12.027  < 2e-16 ***
FALSE population                -3.929e+01  1.064e+00 -36.928  < 2e-16 ***
FALSE households                 7.780e+01  6.659e+00  11.685  < 2e-16 ***
FALSE median_income              3.876e+00  3.322e-02 116.670  < 2e-16 ***
FALSE ocean_proximityINLAND     -3.977e+04  1.736e+03 -22.904  < 2e-16 ***
FALSE ocean_proximityISLAND      1.561e+05  3.077e+04   5.072 3.98e-07 ***
FALSE ocean_proximityNEAR BAY   -3.697e+03  1.906e+03  -1.940  0.05241 .  
FALSE ocean_proximityNEAR OCEAN  4.759e+03  1.563e+03   3.045  0.00233 ** 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 68730 on 20627 degrees of freedom
FALSE Multiple R-squared:  0.6455,  Adjusted R-squared:  0.6452 
FALSE F-statistic:  3129 on 12 and 20627 DF,  p-value: < 2.2e-16

The stepwise model is identical to the full model. This means that after running a stepwise regression, the procedure did not find any predictors to remove or add to improve the model based on AICs. A decision was made to keep the reduced model without Households as a candidate model.

While the reduced model has improved model interpretability, this does have the unintended consequences that the predictions are less accurate. This is a trade-off that is taken as stability in the regression coefficients is preferred. However, if all else fails, a slight amount of multicollinearity is preferred to a far worse model.

3.2.3 Box-Cox Transformation

Box-Cox transformations were conducted on various reduced and log-transformed predictor variables.

The generated Box-Cox graphs all depict a lambda of roughly 0.18. A lambda of 0 is a special power transformation where the Box-Cox transformation degenerates to a log transformation. Therefore, this transformation will turn Median House Value to log(Median House Value).

3.3 Candidate Models

The following models are candidate models:

3.3.1 Box-Cox Transformation: Improved or Not?

The Box-Cox/Log transformation substantially improved the linearity and normality. Therefore all non-log transformed models are removed from the candidates.

3.3.2 Log Transformation of Predictors

Despite this, however, there are still violations in normality and constant variance, which poses a problem. One attempt at solving this was through log transformations of the right-skewed predictor variables in conjunction with the box-cox transformation.

Surprisingly, applying the log transformation to every predictor that exhibited a right skew (total rooms, total bedrooms, population, housing, median income), drastically improved the linearity and constant variance and greatly improved the normality. This full log transformed model still has minor violations in the residual assumptions, however, especially prevalent at the tails.

The reduced model with log transformed predictors did even better yet, fixing the normality of the residuals slightly at the tails.

3.3.3 Diagnostic Analyses of Box-Cox Models

Comparing the residual constant variance and normality with side-by-side plots we get the following:

3.3.4 Goodness-of-fit Measures

Goodness-of-fit measures help to assess how well the model fits the observed data. These measures offer a quantitative value for how closely predicted values model the actual observed values.

The Goodness of Fit measures used here to assess the remaining candidate models are the following:

  • SSE: The sum of squared errors. Computed by adding up (predicted - observed)^2 values.
  • \(R^2\): Coefficient of Determination or R-squared. Measure of the proportion of variation in the dependent variable as explained by the independent variables.
  • \(R^2_{adj}\): Adjusted R-squared, which is a modified R-squared statistic that accounts for number of predictors. It helps reduce the upward bias experienced by normal R-squared.
  • \(C_p\): Mallow’s \(C_p\), which is designed to factor in goodness of fit and model complexity by factoring in estimating variance in residuals (alongside SSE).
  • AIC: Akaike Information Criterion, a calculation that balances model fit with complexity.
  • SBC: Schwarz Bayesian information Criterion, which is similar to AIC in analyzing model fit and complexity, but calculating complexity by factoring in sample size.
  • PRESS: Predicted Residual Sum of Squares, which measures predictive performance based on the residual and leverage values of each observation in the model.

It’s worth noting that SSE, AIC, SBC, and PRESS depend on the magnitude of the response. \(R^2\), \(R^2_{adj}\), and \(C_p\) scale accordingly and are independent of response. This means that when the response is transformed in log transformations or box-cox, \(R^2\), \(R^2_{adj}\), and \(C_p\) are better at measuring goodness of fit.

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
Full Model 9.744085e+13 0.65 0.65 13 459787.79 459890.94 9.782751e+13
Full Box-Cox Model 2.239840e+03 0.66 0.66 13 -45811.85 -45708.69 2.247550e+03
Reduced Box-Cox Model 2.253870e+03 0.66 0.66 12 -45684.99 -45589.77 2.259680e+03
Full Log Predictors Model 2.087270e+03 0.69 0.69 13 -47267.92 -47164.76 2.092090e+03

3.4 Final Model from Random Sampling

In selecting the optimal model for our analysis, we have decided to utilize the full model with a Box-Cox transformation, as opposed to the reduced model. This decision is informed by several considerations that strike a balance between statistical rigor and practical application, particularly in the context of a health study.

The full model, inclusive of “clinically important” variables as well as other predictors, has been chosen despite the potential for multicollinearity. This is because the adjusted Generalized Variance Inflation Factor (GVIF) values are only marginally above the threshold of 5, suggesting that multicollinearity may not substantially impair the interpretability or predictive power of our model. Additionally, the stepwise regression procedure, which includes both backward and forward selection based on the Akaike Information Criterion (stepAIC), has indicated that the full model is the most appropriate. The use of stepAIC ensures that each variable’s inclusion is justified based on its contribution to model fit while penalizing unnecessary complexity.

The Box-Cox transformation was applied and subsequently chosen due to its significant improvement of the model fit over the non-transformed version. This transformation addresses issues of non-normality in the residuals, which can bias parameter estimates and standard errors, leading to invalid inferences. By stabilizing variance and making the data more closely meet the assumptions of linear regression, the Box-Cox transformed model provides a more reliable foundation for drawing conclusions about the relationships between variables.

We have opted not to go with log transforming any of the predictor variables. While transforming the predictors increased the adjusted R-squared value by 0.03, the model was complicated and confusing to interpret. In our case, we prioritize the practical significance and interpretability of the model coefficients in their original units. Moreover, the marginal improvement in the residuals from the log transformation does not outweigh the benefits of maintaining a model that is straightforward to interpret for practitioners and stakeholders.

Thus, the full model with Box-Cox transformation is chosen for its balance of statistical and practical significance.

4 Bootstrapping

Bootstrapping is used to estimate the distribution of a sample statistic without making assumptions about the underlying population distribution. By repeatedly resampling with replacement from the data, it generates a new distribution which can be used to assess the variability and stability of the model estimates. This works because it operates on the assumption that this sample is a good representation of the population, so resampling generates new samples of the actual population. This technique helps us alleviate concerns about the influence of outliers, the normality of residuals, and the precision of confidence/prediction intervals.

4.1 Bootstrap Cases

Bootstrap cases is the process of resampling an entire observation from the original sample data. This preserves any correlation between variables and is a more holistic approach to preserving each observation.

To estimate the confidence intervals, we performed bootstrapping with B = 1000 replicates. The coefficients from these replicates are stored in a matrix ‘coef.mtrx’.

Histograms are then made of the distribution of bootstrap regression coefficients where one observation is one run of the bootstrap regression.

We then visualize the distribution of the bootstrap estimate for each coefficient using histograms and overaly two density curves: - The red density curve is the estimated regression coefficients and its p-values are reported in the output. - The blue density curve is the estimate based off the bootstrap sampling distribution.

The histograms show that the red and blue density curves are closely aligned, indicating consistency between the significance tests (p-values) and the bootstrap confidence intervals.

95% Confidence intervals are conducted on these bootstrap regression coefficients and are combined with the final model’s coefficient matrix.

Regression Coefficient Matrix
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) -2.1900 0.4194 -5.2217 0.0000 [ -3.1524 , -1.2794 ]
longitude -0.1604 0.0049 -32.9988 0.0000 [ -0.1719 , -0.1495 ]
latitude -0.1562 0.0048 -32.5882 0.0000 [ -0.1678 , -0.1451 ]
housing_median_age 0.0025 0.0002 11.7091 0.0000 [ 0.002 , 0.0029 ]
total_rooms -0.0000 0.0000 -2.2477 0.0246 [ 0 , 0 ]
total_bedrooms 0.0003 0.0000 9.5032 0.0000 [ 0.0002 , 0.0004 ]
population -0.0002 0.0000 -34.8716 0.0000 [ -0.0002 , -0.0001 ]
households 0.0004 0.0000 11.3658 0.0000 [ 0.0002 , 0.0005 ]
median_income 0.0000 0.0000 104.8550 0.0000 [ 0 , 0 ]
ocean_proximityINLAND -0.3099 0.0083 -37.2314 0.0000 [ -0.3299 , -0.2921 ]
ocean_proximityISLAND 0.6016 0.1475 4.0773 0.0000 [ 0.3006 , 0.8211 ]
ocean_proximityNEAR BAY -0.0374 0.0091 -4.0960 0.0000 [ -0.0556 , -0.0195 ]
ocean_proximityNEAR OCEAN -0.0319 0.0075 -4.2544 0.0000 [ -0.047 , -0.0158 ]

The summary table reveals that the significance tests based on p-values align with the bootstrap confidence intervals, confirming our robust results.

4.2 Bootstrap Residuals

Bootstrap residuals only resamples the residuals of a fitted model. This involves creating a fitted model to the data, calculating the residuals, then resampling with replacement from these residuals. However, this method operates on the assumptions that the model structure is accurate and that non-error components are nonrandom.

In this case, we are bootstrapping the residuals from the model that has all predictors and log(Median House Value).

A histogram is made of the residual bootstrap estimates for each regression coefficient.

In the coefficient histograms, we observe that the normal and LOESS density curves closely align, which indicates that the p-value-based inference and residual bootstrap inference yielded consistent results.

Regression Coefficient Matrix with 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) -2.1900 0.4194 -5.2217 0.0000 [ -2.991 , -1.421 ]
longitude -0.1604 0.0049 -32.9988 0.0000 [ -0.17 , -0.1515 ]
latitude -0.1562 0.0048 -32.5882 0.0000 [ -0.1661 , -0.1469 ]
housing_median_age 0.0025 0.0002 11.7091 0.0000 [ 0.002 , 0.0029 ]
total_rooms -0.0000 0.0000 -2.2477 0.0246 [ 0 , 0 ]
total_bedrooms 0.0003 0.0000 9.5032 0.0000 [ 0.0002 , 0.0003 ]
population -0.0002 0.0000 -34.8716 0.0000 [ -0.0002 , -0.0002 ]
households 0.0004 0.0000 11.3658 0.0000 [ 0.0003 , 0.0004 ]
median_income 0.0000 0.0000 104.8550 0.0000 [ 0 , 0 ]
ocean_proximityINLAND -0.3099 0.0083 -37.2314 0.0000 [ -0.3266 , -0.293 ]
ocean_proximityISLAND 0.6016 0.1475 4.0773 0.0000 [ 0.3251 , 0.885 ]
ocean_proximityNEAR BAY -0.0374 0.0091 -4.0960 0.0000 [ -0.0559 , -0.0195 ]
ocean_proximityNEAR OCEAN -0.0319 0.0075 -4.2544 0.0000 [ -0.0464 , -0.0182 ]

The resulting intervals align with the significance tests based on p-values, suggesting the estimated coefficients approximate normal distributions well, likely due to a sufficiently large sample size.

5 Results

5.1 Combining Inferential Statistics

We synthesize inferential statistics into a single table for comparison of both bootstrap methods:

Final Combined Inferential Statistics: p-values and Bootstrap CIs
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) -2.1900 0.4194 0.0000 [ -3.1524 , -1.2794 ] [ -2.991 , -1.421 ]
longitude -0.1604 0.0049 0.0000 [ -0.1719 , -0.1495 ] [ -0.17 , -0.1515 ]
latitude -0.1562 0.0048 0.0000 [ -0.1678 , -0.1451 ] [ -0.1661 , -0.1469 ]
housing_median_age 0.0025 0.0002 0.0000 [ 0.002 , 0.0029 ] [ 0.002 , 0.0029 ]
total_rooms -0.0000 0.0000 0.0246 [ 0 , 0 ] [ 0 , 0 ]
total_bedrooms 0.0003 0.0000 0.0000 [ 0.0002 , 0.0004 ] [ 0.0002 , 0.0003 ]
population -0.0002 0.0000 0.0000 [ -0.0002 , -0.0001 ] [ -0.0002 , -0.0002 ]
households 0.0004 0.0000 0.0000 [ 0.0002 , 0.0005 ] [ 0.0003 , 0.0004 ]
median_income 0.0000 0.0000 0.0000 [ 0 , 0 ] [ 0 , 0 ]
ocean_proximityINLAND -0.3099 0.0083 0.0000 [ -0.3299 , -0.2921 ] [ -0.3266 , -0.293 ]
ocean_proximityISLAND 0.6016 0.1475 0.0000 [ 0.3006 , 0.8211 ] [ 0.3251 , 0.885 ]
ocean_proximityNEAR BAY -0.0374 0.0091 0.0000 [ -0.0556 , -0.0195 ] [ -0.0559 , -0.0195 ]
ocean_proximityNEAR OCEAN -0.0319 0.0075 0.0000 [ -0.047 , -0.0158 ] [ -0.0464 , -0.0182 ]

This table demonstrates that all 3 methods came up with significant individual predictor variables. As a result of the findings for bootstrap residuals, bootstrap cases, and traditional inferential methods yielding significant predictors, it validates the model’s estimates. This demonstrates that the model is stable, implies that the predictive relationships are truly present in the data, and concludes that the findings are robust.

5.2 Interpretation in Context

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.1900420 0.4194125 -5.221689 0.0000002
longitude -0.1604085 0.0048610 -32.998810 0.0000000
latitude -0.1562148 0.0047936 -32.588174 0.0000000
housing_median_age 0.0024536 0.0002095 11.709063 0.0000000
total_rooms -0.0000083 0.0000037 -2.247698 0.0246060
total_bedrooms 0.0002703 0.0000284 9.503196 0.0000000
population -0.0001779 0.0000051 -34.871571 0.0000000
households 0.0003628 0.0000319 11.365763 0.0000000
median_income 0.0000167 0.0000002 104.855008 0.0000000
ocean_proximityINLAND -0.3099273 0.0083244 -37.231402 0.0000000
ocean_proximityISLAND 0.6015525 0.1475369 4.077303 0.0000457
ocean_proximityNEAR BAY -0.0374307 0.0091384 -4.095974 0.0000422
ocean_proximityNEAR OCEAN -0.0318756 0.0074924 -4.254393 0.0000211

The final model has 8 numerical predictors and 1 categorical variable that is split into 4 dummy variables. As a result, the model conforms to the following format:

\(Log(MedianHouseValue) = \beta_0 + \beta_1 * Longitude + \beta_2 * Latitude + \beta_3 * HousingMedianAge + \beta_4 * TotalRooms + \beta_5 * TotalBedrooms + \beta_6 * Population + \beta_7 * Households + \beta_8 * MedianIncome + \beta_9 * Ocean Proximity_{Inland} + \beta_10 * Ocean Proximity_{Island} + \beta_11 * Ocean Proximity_{Near Bay} + \beta_12 * Ocean Proximity_{Near Ocean} + \epsilon\)

where each beta represents a coefficient in the linear model and epsilon represents the residuals or error value.

The final model is as follows:

Log(Median House Value) = -2.19 - 0.16 * Longitude - 0.156 * Latitude + 0.00245 * Housing Median Age - \(8.3 * 10^{-6}\) * Total Rooms + \(2.7 * 10^{-4}\) * Total Bedrooms - \(1.78 * 10^{-4}\) * Population + \(3.63 * 10^{-4}\) * Households + \(1.67 * 10^{-5}\) * Median Income - 0.310 * \(Ocean Proximity_{Inland}\) + 0.602 * \(Ocean Proximity_{Island}\) - 0.0374 * \(Ocean Proximity_{Near Bay}\) - 0.0319 * \(Ocean Proximity_{Near Ocean}\).

In context, this means that if every other predictor variable was held constant, a one degree change in Longitude would result in Log(Median House Value) decreasing by 0.16. This logic applies to every other regression coefficient.

One thing to note, however, is that the response variable is Log(Median House Value). So if you wanted to interpret this value in terms of US Dollars, you’d take Euler’s number to the power of the response to convert it from Log(US Dollars) to US Dollars.

6 Discussion

6.1 Main Findings

The main findings are that the original full model was rife with violations in the assumptions. In order to alleviate these concerns, many transformations were attempted i.e. reducing the model, box-cox transformations, log transformations of predictor variables, etc. In the end a decision was made to balance statistical significance with practical significance, which resulted in going with a simplified version of the Box-Cox transformation where the response variable was the only variable transformed.

As a result, both bootstrap methods and the traditional inferential statistics found that all predictor variables were statistically significant at significance level \(\alpha\) = 0.05. Using bootstrap confidence intervals, we further found that these findings are robust in spite of minor violations of the assumptions. This means that we can comfortably recommend using this model to predict median house value in California homes within the context of US Census Bureau block groups.

6.2 Drawbacks and Future Improvements

This decision to forego statistical significance to enhance model interpretability comes with drawbacks.

  • Less Predictive power: Log-transforming the predictor variables resulted in a higher R^2 value, which indicates better predictive power. By purposefully choosing a model with lower R^2, we have elected to have a model that isn’t as good at predicting.

  • Overlooking interactions: By choosing to simplify the model, we forego any analysis of potential interactions between variables.

In addition, collinearity and multicollinearity in the model was not adequately addressed. Many of the variables such as Total Rooms and Total Bedrooms could linearly predict each other relatively accurate. This poses a problem by inflating standard errors and could lead to unreliable estimates for regression models. Some of the regression coefficients in the model were relatively low in magnitude, indicating that certain variables had less of an impact on the final predicting of the response; these variables could be reconsidered on whether or not they can be foregone.

6.3 Applications of Final Model

This model is useful in predicting housing prices in California based of 1990 US Census data. This model can be used comparatively with future census results in California to compare in a time series on the impacts of certain predictors to see if certain predictors become more or less important towards predicting housing prices.

It’s recommended that this model be used with caution as there are still minor violations in assumptions. While, these have been addressed by bootstrap methods and transformations, this could cause the model to be inaccurate at the higher and lower extremes. In addition, the underlying data was discretized before model analysis was run on it; this can also impact any results the model will provide.

References

Wang, H.(2018). housing.csv[Data set]. https://www.kaggle.com/code/harrywang/housing-price-prediction/input?select=housing.csv

O’Reilly Media (2017). California Housing. GitHub. https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv

GPS coordinates, latitude and longitude with interactive maps. GPS coordinates, latitude and longitude with interactive Maps. (2023). https://www.gps-coordinates.net/

Frees, E. W. (2009). Regression Modeling with Actuarial and Financial Applications, 465–489. https://doi.org/10.1017/cbo9780511814372

Raj, P. (2023, July 1). Residual vs fitted graph in R. Stack Overflow. https://stackoverflow.com/questions/76605232/residual-vs-fitted-graph-in-r