Abstract
This statistical report delves into the intricate dynamics of
California house prices, utilizing a modified version of the California
Housing data set derived from the 1990 census data. Employing
methodologies such as Box-Cox transformations and bootstrapping with
cases and residuals, this analysis aims to uncover nuanced patterns and
relationships within this housing market.
Key predictors that were analyzed spanned geographical location,
housing characteristics, demographic details, and economic indicators of
block groups and their inhabitants.
The study begins with exploratory data analysis, employing Box-Cox
transformations to address issues of non-normality and
heteroscedasticity in the data set. Subsequently, a robust bootstrapping
approach is applied across cases then residuals, offering a
comprehensive assessment given our inability to make assumptions about
the underlying distribution. Through these techniques, we seek to
provide a more accurate understanding of the factors influencing house
prices across California’s diverse census block groups.
Section 1 Introduction
1.1 Motivation
The motivation behind undertaking this comprehensive data analysis of
California house prices lies in the quest to unravel the intricate
dynamics influencing one of the most pivotal aspects of the state’s
economy – its real estate market. As housing affordability and market
trends continue to be of paramount importance for residents,
policymakers, and real estate professionals alike, there exists a
compelling need to delve into the underlying factors that shape housing
prices across diverse census block groups.
This analysis seeks to provide actionable insights for informed
decision-making in various spheres. By applying advanced statistical
techniques such as Box-Cox transformations and bootstrapping, we aim to
not only mitigate data challenges but also enhance the robustness and
accuracy of our findings. The significance of this study is amplified by
its foundation in the 1990 California census data, offering a historical
perspective that enriches our understanding of how socio-economic and
spatial factors have shaped the housing landscape over time.
1.2 Data set Description
The data set was found on Kaggle. It includes information gathered
from various block groups in California during the 1990 Census. The U.S.
Census Bureau uses the block group as the smallest geographical unit,
typically consisting of 600 to 3000 people. Each observation in the data
set represents one block group and comprises an average of 1425.5
individuals in a geographically compact area.
This data set is a modified version of the data originally collected
by the US Census Bureau: Certain values were removed from the original
data set; In addition, a categorical dependent variable, Ocean
Proximity, was created. As a result, the data set contains 20640
observations of 9 dependent variables and one independent variable:
median house value.
The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%203/ca-housing-price.csv.
- Longitude(X1): How far west block group is located from prime
meridian
- Latitude(X2): How far north block group is located from equator
- HousingMedianAge(X3): Median age of house within block group
- TotalRooms(X4): Total number of rooms in block
- TotalBedrooms(X5): Total number of bedrooms in block
- Population(X6): Total number of residents in block group
- Households(X7): Total number of households, defined as a group of
people residing within a home unit
- MedianIncome(X8): Median income, in thousands of US Dollars, for
households within a block group
- MedianHouseValue(Y): Median house value, in US Dollars, in a
block
- OceanProximity(X9): Location of the house to an ocean or sea,
displayed as a categorical variable
1.3 Research Question
The primary objective of this analysis is to identify the primary
factors that influence house prices amongst California census block
groups.
Practical questions that this research aims to explore are as
follows:
- What socioeconomic factors exhibit the largest correlation with
median house price?
- What transformations may need to be conducted on the data, or do no
transformations need to be done?
This analysis will seek to explore 2 hypotheses:
2.1 Data Characterisitics
These data were originally published in the 1997 paper titled
Sparse Spatial Autoregressions by Pace et al. Each observation
represents one census block group from the 1990 US Census conducted in
California.
There is only one response variable - the median house value of the
block group. There are 9 predictor variables consisting of 8 numerical
variables and one nominal variable. These numerical variables describe
geographic location (longitude, latitude), house characteristics (median
house age, total rooms, total bedrooms), resident characteristics
(population, households), and socioeconomic characteristics (median
household income). The categorical variable very roughly describes 5
categories of relation to the ocean (<1 hour away, inland, near
ocean, near bay, or island).
Summary of Data Set’s Statistics
| longitude |
Min. :-124.3 |
1st Qu.:-121.8 |
Median :-118.5 |
Mean :-119.6 |
3rd Qu.:-118.0 |
Max. :-114.3 |
NA |
| latitude |
Min. :32.54 |
1st Qu.:33.93 |
Median :34.26 |
Mean :35.63 |
3rd Qu.:37.71 |
Max. :41.95 |
NA |
| housing_median_age |
Min. : 1.00 |
1st Qu.:18.00 |
Median :29.00 |
Mean :28.64 |
3rd Qu.:37.00 |
Max. :52.00 |
NA |
| total_rooms |
Min. : 2 |
1st Qu.: 1448 |
Median : 2127 |
Mean : 2636 |
3rd Qu.: 3148 |
Max. :39320 |
NA |
| total_bedrooms |
Min. : 1.0 |
1st Qu.: 296.0 |
Median : 435.0 |
Mean : 537.9 |
3rd Qu.: 647.0 |
Max. :6445.0 |
NA’s :207 |
| population |
Min. : 3 |
1st Qu.: 787 |
Median : 1166 |
Mean : 1425 |
3rd Qu.: 1725 |
Max. :35682 |
NA |
| households |
Min. : 1.0 |
1st Qu.: 280.0 |
Median : 409.0 |
Mean : 499.5 |
3rd Qu.: 605.0 |
Max. :6082.0 |
NA |
| median_income |
Min. : 0.4999 |
1st Qu.: 2.5634 |
Median : 3.5348 |
Mean : 3.8707 |
3rd Qu.: 4.7432 |
Max. :15.0001 |
NA |
| median_house_value |
Min. : 14999 |
1st Qu.:119600 |
Median :179700 |
Mean :206856 |
3rd Qu.:264725 |
Max. :500001 |
NA |
| ocean_proximity |
Length:20640 |
Class :character |
Mode :character |
NA |
NA |
NA |
NA |
A summary of the statistics’ distributions are made in the table
above. Our 9 numerical variables have descriptive statistics describing
minimum value, 1st quartile/25th percentile, median/50th percentile,
mean, 3rd quartile/75th percentile, and maximum values. For our
categorical variable it shows count. For all variables the last column
counts the total number of NAs or none values. Our data set no missing
values except 207 missing values for total bedrooms. This issue will be
resolved by imputing the missing values, which is the process of
replacing missing values with a statistical estimate based off other
observations. In this case, we are imputing the value with 435, the
median number of total bedrooms per block group.
After imputing the data with the median, the new total number of NAs
in the Total Bedrooms variable is 0.
# proof of imputation
sum(is.na(cahouses$total_bedrooms))
FALSE [1] 0
2.2 Exploratory Data Analysis
Exploratory data analysis is conducted to assess the distributions of
key variables.
2.2.1 Geographic Location
There are 3 variables that describe geographic locations of houses:
longitude, latitude, and ocean_proximity. A plot was first created that
plots house locations by longitude and latitude.

The longitude and latitude values plotted depict precise values that
describe the data in terms of geographical location.
Next, 5 histograms were plotted based off what value each observation
had for OceanProximity.

As depicted above, despite there being 5 unique values for
OceanProximity, <1H from ocean, Near ocean, and Near bay all have
distributions that closely resemble one another. This is a concern that
may be addressed later if multicollinearity is found between longitude,
latitude, and ocean proximity.
2.2.3 House Characteristics
The distributions for Housing Median Age, Total Rooms, and Total
Bedrooms is plotted below.

The distribution for the median house age looks relatively normal,
except the last bar for the histogram, which is unusually large. Upon
closer inspection, this is a result of the data being discretized
already. The last 500+ median age observations are 52 years old; this is
likely a result of all houses aged 50+ being recorded as 52 years old.
No transformation is likely required as normality is roughly
satisfied.

The plots for total rooms and total bedrooms both exhibit heavy right
skews. This is unsurprising as apartment complexes tend to have a
smaller number of rooms; since the census block group factors in
population into its calculation of what geographic area encompasses a
block group, dense populations will create a higher number of low room
count block groups. As a result, a transformation is likely to be
required to correct these skews.
2.2.4 Inhabitant Characteristics
The distributions for population and households are plotted
below.

Similar to the logic behind total rooms and total bedrooms,
population dense areas contribute to smaller geographic areas such as
urban and suburban areas; this leads to a larger frequency of block
groups with a smaller population and household count. As a result, a
transformation is likely to be required to correct these skews.
2.2.5 VIFs and Pairwise Scatterplots
VIFs or Variance Inflation Factors quantify the inflation of
coefficients due to collinearity. Higher multicollinearity indicate
stronger multicollinearity, which can lead to less reliable models. A
VIF of 1 indicates zero collinearity, a VIF of 4 indicates potential
multicollinearity, and a VIF of 10 indicates a strong presence of
multicollinearity.
FALSE GVIF Df GVIF^(1/(2*Df))
FALSE longitude 18.028444 1 4.245992
FALSE latitude 19.925764 1 4.463828
FALSE housing_median_age 1.321927 1 1.149751
FALSE total_rooms 12.349114 1 3.514131
FALSE total_bedrooms 27.040073 1 5.200007
FALSE population 6.342122 1 2.518357
FALSE households 28.315383 1 5.321220
FALSE median_income 1.740468 1 1.319268
FALSE ocean_proximity 4.090317 4 1.192531
The GVIF^(1/(2*Df)) value helps measure multicollinearity. It
quantifies how much variance increases if predictor variables are
correlated. The adjusted GVIF variables are adjusted for degrees of
freedom. An adjusted GVIF value of below 5 indicates little to no
multicollinearity. The variables Total Bedrooms and Households are
slightly above an adjusted GVIF value of 5, so that may be of concern
going forward.

The pair-wise is conducted on all variables except the response
variable Median House Value and the categorical variable Ocean
Proximity. The response variable should not be factored into the
pair-wise comparisons as its correlation is irrelevant to reducing the
model. The categorical variable is omitted because it pair-wise scatter
plots are generally used for only numerical variables and displaying the
relationship between them.
The pair-wise scatter plot of these 8 numerical variables shows 2
groups of variables that are heavily correlated with one another:
Longitude with Latitude; and Total Rooms, Total Bedrooms, Population,
and Households all with each other. Within each group, the correlation
was found to be r > 0.80 which represents a large correlation. This
demonstrates that there is some collinearity present between pairs of
predictors as well.
As a result of these VIFs and pair-wise scatter plots, it is
recommended that potential multicollinearity should be examined further
and see if removing any predictors is useful. These reduced models will
then be made candidate models.
3 Regression Modeling
After doing our exploratory data analysis and preparing the data,
next steps are to model the regression with an appropriate linear model
that not only satisfies the relationship, but also satisfies any
underlying assumptions about the data.
3.1 Model Refinement
From Section 2.2.5 we discovered potential multicollinearity in the
full model; and from Sections 2.2.2, 2.2.3, and 2.2.4, we see that
Median Income, Total Rooms, Total Bedrooms, Population, and Households
are all right-skewed. While it’s important to address both concerns,
addressing multicollinearity first helps so that we work with reduced
models where multicollinearity is not present.
3.2 Full Model Transformation
The full model is the linear model that includes all predictors.
Coefficient Matrix for Full Linear Model
|
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
|
(Intercept)
|
-2233091.969942
|
87478.9016089
|
-25.527206
|
0.0000000
|
|
longitude
|
-26430.444664
|
1013.8905512
|
-26.068341
|
0.0000000
|
|
latitude
|
-25173.279834
|
999.8252471
|
-25.177680
|
0.0000000
|
|
housing_median_age
|
1057.816131
|
43.7058419
|
24.203083
|
0.0000000
|
|
total_rooms
|
-4.730582
|
0.7706335
|
-6.138563
|
0.0000000
|
|
total_bedrooms
|
71.344944
|
5.9318763
|
12.027382
|
0.0000000
|
|
population
|
-39.287456
|
1.0639024
|
-36.927687
|
0.0000000
|
|
households
|
77.804415
|
6.6585750
|
11.684845
|
0.0000000
|
|
median_income
|
3.876045
|
0.0332222
|
116.670336
|
0.0000000
|
|
ocean_proximityINLAND
|
-39766.398744
|
1736.2502441
|
-22.903610
|
0.0000000
|
|
ocean_proximityISLAND
|
156065.719822
|
30772.4819840
|
5.071600
|
0.0000004
|
|
ocean_proximityNEAR BAY
|
-3697.401661
|
1906.0410182
|
-1.939833
|
0.0524136
|
|
ocean_proximityNEAR OCEAN
|
4758.753612
|
1562.7233999
|
3.045167
|
0.0023284
|
Residual plots are also conducted to check for assumptions.

The residual plots show minor violations in linearity, constant
variance, and normality. As mentioned in Section 2.2.5, the full model
has a heavy presence of multicollinearity, making it a poor model to
predict median house value.
3.2.1 Addressing Multicollinearity
In order to address multicollinearity, it is good practice to remove
one of the correlated variables. In this case, the highest adjusted
GVIFs are in Total Bedrooms and Households, so those will be iteratively
removed to assess for reduced model quality.
FALSE GVIF Df GVIF^(1/(2*Df))
FALSE longitude 17.877996 1 4.228238
FALSE latitude 19.838932 1 4.454092
FALSE housing_median_age 1.319408 1 1.148655
FALSE total_rooms 9.697980 1 3.114158
FALSE population 6.026165 1 2.454825
FALSE households 11.773368 1 3.431234
FALSE median_income 1.567345 1 1.251936
FALSE ocean_proximity 4.066419 4 1.191658
FALSE
FALSE Call:
FALSE lm(formula = median_house_value ~ . - total_bedrooms, data = cahouses)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -535915 -43323 -10823 29169 837443
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) -2.122e+06 8.729e+04 -24.308 < 2e-16 ***
FALSE longitude -2.532e+04 1.013e+03 -24.988 < 2e-16 ***
FALSE latitude -2.438e+04 1.001e+03 -24.352 < 2e-16 ***
FALSE housing_median_age 1.035e+03 4.382e+01 23.618 < 2e-16 ***
FALSE total_rooms -4.360e-01 6.853e-01 -0.636 0.5246
FALSE population -4.214e+01 1.041e+00 -40.497 < 2e-16 ***
FALSE households 1.390e+02 4.309e+00 32.265 < 2e-16 ***
FALSE median_income 3.750e+00 3.164e-02 118.536 < 2e-16 ***
FALSE ocean_proximityINLAND -4.107e+04 1.739e+03 -23.616 < 2e-16 ***
FALSE ocean_proximityISLAND 1.628e+05 3.087e+04 5.274 1.35e-07 ***
FALSE ocean_proximityNEAR BAY -3.082e+03 1.912e+03 -1.612 0.1070
FALSE ocean_proximityNEAR OCEAN 5.043e+03 1.568e+03 3.216 0.0013 **
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 68970 on 20628 degrees of freedom
FALSE Multiple R-squared: 0.643, Adjusted R-squared: 0.6428
FALSE F-statistic: 3377 on 11 and 20628 DF, p-value: < 2.2e-16
FALSE GVIF Df GVIF^(1/(2*Df))
FALSE longitude 17.735744 1 4.211383
FALSE latitude 19.706970 1 4.439253
FALSE housing_median_age 1.321463 1 1.149549
FALSE total_rooms 12.316689 1 3.509514
FALSE total_bedrooms 11.243102 1 3.353074
FALSE population 4.880018 1 2.209076
FALSE median_income 1.740394 1 1.319240
FALSE ocean_proximity 4.085804 4 1.192366
FALSE
FALSE Call:
FALSE lm(formula = median_house_value ~ . - households, data = cahouses)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -553232 -42896 -10536 29110 685840
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) -2.369e+06 8.698e+04 -27.241 < 2e-16 ***
FALSE longitude -2.794e+04 1.009e+03 -27.693 < 2e-16 ***
FALSE latitude -2.640e+04 9.976e+02 -26.461 < 2e-16 ***
FALSE housing_median_age 1.067e+03 4.384e+01 24.346 < 2e-16 ***
FALSE total_rooms -4.269e+00 7.722e-01 -5.529 3.26e-08 ***
FALSE total_bedrooms 1.243e+02 3.838e+00 32.397 < 2e-16 ***
FALSE population -3.332e+01 9.363e-01 -35.585 < 2e-16 ***
FALSE median_income 3.879e+00 3.333e-02 116.367 < 2e-16 ***
FALSE ocean_proximityINLAND -4.013e+04 1.742e+03 -23.040 < 2e-16 ***
FALSE ocean_proximityISLAND 1.487e+05 3.087e+04 4.817 1.47e-06 ***
FALSE ocean_proximityNEAR BAY -3.750e+03 1.912e+03 -1.961 0.04988 *
FALSE ocean_proximityNEAR OCEAN 4.395e+03 1.568e+03 2.804 0.00506 **
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 68960 on 20628 degrees of freedom
FALSE Multiple R-squared: 0.6431, Adjusted R-squared: 0.6429
FALSE F-statistic: 3379 on 11 and 20628 DF, p-value: < 2.2e-16
FALSE GVIF Df GVIF^(1/(2*Df))
FALSE longitude 17.734162 1 4.211195
FALSE latitude 19.697471 1 4.438183
FALSE housing_median_age 1.318726 1 1.148358
FALSE total_rooms 4.817136 1 2.194797
FALSE population 4.564656 1 2.136506
FALSE median_income 1.337037 1 1.156303
FALSE ocean_proximity 4.013501 4 1.189708
FALSE
FALSE Call:
FALSE lm(formula = median_house_value ~ . - total_bedrooms - households,
FALSE data = cahouses)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -500267 -45195 -11777 30104 504398
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) -2.353e+06 8.916e+04 -26.390 < 2e-16 ***
FALSE longitude -2.825e+04 1.034e+03 -27.314 < 2e-16 ***
FALSE latitude -2.711e+04 1.022e+03 -26.514 < 2e-16 ***
FALSE housing_median_age 1.003e+03 4.490e+01 22.335 < 2e-16 ***
FALSE total_rooms 1.525e+01 4.950e-01 30.808 < 2e-16 ***
FALSE population -2.561e+01 9.283e-01 -27.586 < 2e-16 ***
FALSE median_income 3.359e+00 2.995e-02 112.156 < 2e-16 ***
FALSE ocean_proximityINLAND -4.677e+04 1.773e+03 -26.377 < 2e-16 ***
FALSE ocean_proximityISLAND 1.527e+05 3.164e+04 4.826 0.0000014 ***
FALSE ocean_proximityNEAR BAY -1.346e+03 1.959e+03 -0.687 0.49200
FALSE ocean_proximityNEAR OCEAN 4.385e+03 1.607e+03 2.729 0.00636 **
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 70690 on 20629 degrees of freedom
FALSE Multiple R-squared: 0.6249, Adjusted R-squared: 0.6248
FALSE F-statistic: 3437 on 10 and 20629 DF, p-value: < 2.2e-16
While all 3 reduced models have no multicollinearity, we are also
concerned with whether the reduced model has significant coefficients.
By removing Total Bedrooms, it makes either Total Rooms or dummy
variable Ocean Proximity - near bay not statistically significant.
Therefore, the reduced model we’re going with is the reduced model where
only the Households predictor variable is removed.
3.2.2 Stepwise Regression
Step-wise selection is run on the full model to determine which
predictors that could be removed to improve the data.
#stepwise regression
step_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(step_model)
FALSE
FALSE Call:
FALSE lm(formula = median_house_value ~ longitude + latitude + housing_median_age +
FALSE total_rooms + total_bedrooms + population + households +
FALSE median_income + ocean_proximity, data = cahouses)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -550613 -42739 -10602 28750 794919
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) -2.233e+06 8.748e+04 -25.527 < 2e-16 ***
FALSE longitude -2.643e+04 1.014e+03 -26.068 < 2e-16 ***
FALSE latitude -2.517e+04 9.998e+02 -25.178 < 2e-16 ***
FALSE housing_median_age 1.058e+03 4.371e+01 24.203 < 2e-16 ***
FALSE total_rooms -4.731e+00 7.706e-01 -6.139 8.48e-10 ***
FALSE total_bedrooms 7.134e+01 5.932e+00 12.027 < 2e-16 ***
FALSE population -3.929e+01 1.064e+00 -36.928 < 2e-16 ***
FALSE households 7.780e+01 6.659e+00 11.685 < 2e-16 ***
FALSE median_income 3.876e+00 3.322e-02 116.670 < 2e-16 ***
FALSE ocean_proximityINLAND -3.977e+04 1.736e+03 -22.904 < 2e-16 ***
FALSE ocean_proximityISLAND 1.561e+05 3.077e+04 5.072 3.98e-07 ***
FALSE ocean_proximityNEAR BAY -3.697e+03 1.906e+03 -1.940 0.05241 .
FALSE ocean_proximityNEAR OCEAN 4.759e+03 1.563e+03 3.045 0.00233 **
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 68730 on 20627 degrees of freedom
FALSE Multiple R-squared: 0.6455, Adjusted R-squared: 0.6452
FALSE F-statistic: 3129 on 12 and 20627 DF, p-value: < 2.2e-16
The stepwise model is identical to the full model. This means that
after running a stepwise regression, the procedure did not find any
predictors to remove or add to improve the model based on AICs. A
decision was made to keep the reduced model without Households as a
candidate model.
While the reduced model has improved model interpretability, this
does have the unintended consequences that the predictions are less
accurate. This is a trade-off that is taken as stability in the
regression coefficients is preferred. However, if all else fails, a
slight amount of multicollinearity is preferred to a far worse
model.
3.3 Candidate Models
The following models are candidate models:
- Full Model: No transformations
- Reduced Model: Full model without Households variable
- Box-Cox Full Model: Full model but log transformation of the
response variable i.e. log(Median House Value)
- Box-Cox Reduced Model: Reduced model but log transformation of the
response variable i.e. log(Median House Value)
3.3.3 Diagnostic Analyses of Box-Cox Models
Comparing the residual constant variance and normality with
side-by-side plots we get the following:


3.3.4 Goodness-of-fit Measures
Goodness-of-fit measures help to assess how well the model fits the
observed data. These measures offer a quantitative value for how closely
predicted values model the actual observed values.
The Goodness of Fit measures used here to assess the remaining
candidate models are the following:
- SSE: The sum of squared errors. Computed by adding up (predicted -
observed)^2 values.
- \(R^2\): Coefficient of
Determination or R-squared. Measure of the proportion of variation in
the dependent variable as explained by the independent variables.
- \(R^2_{adj}\): Adjusted R-squared,
which is a modified R-squared statistic that accounts for number of
predictors. It helps reduce the upward bias experienced by normal
R-squared.
- \(C_p\): Mallow’s \(C_p\), which is designed to factor in
goodness of fit and model complexity by factoring in estimating variance
in residuals (alongside SSE).
- AIC: Akaike Information Criterion, a calculation that balances model
fit with complexity.
- SBC: Schwarz Bayesian information Criterion, which is similar to AIC
in analyzing model fit and complexity, but calculating complexity by
factoring in sample size.
- PRESS: Predicted Residual Sum of Squares, which measures predictive
performance based on the residual and leverage values of each
observation in the model.
It’s worth noting that SSE, AIC, SBC, and PRESS depend on the
magnitude of the response. \(R^2\),
\(R^2_{adj}\), and \(C_p\) scale accordingly and are independent
of response. This means that when the response is transformed in log
transformations or box-cox, \(R^2\),
\(R^2_{adj}\), and \(C_p\) are better at measuring goodness of
fit.
Goodness-of-fit Measures of Candidate Models
|
|
SSE
|
R.sq
|
R.adj
|
Cp
|
AIC
|
SBC
|
PRESS
|
|
Full Model
|
9.744085e+13
|
0.65
|
0.65
|
13
|
459787.79
|
459890.94
|
9.782751e+13
|
|
Full Box-Cox Model
|
2.239840e+03
|
0.66
|
0.66
|
13
|
-45811.85
|
-45708.69
|
2.247550e+03
|
|
Reduced Box-Cox Model
|
2.253870e+03
|
0.66
|
0.66
|
12
|
-45684.99
|
-45589.77
|
2.259680e+03
|
|
Full Log Predictors Model
|
2.087270e+03
|
0.69
|
0.69
|
13
|
-47267.92
|
-47164.76
|
2.092090e+03
|
3.4 Final Model from Random Sampling
In selecting the optimal model for our analysis, we have decided to
utilize the full model with a Box-Cox transformation, as opposed to the
reduced model. This decision is informed by several considerations that
strike a balance between statistical rigor and practical application,
particularly in the context of a health study.
The full model, inclusive of “clinically important” variables as well
as other predictors, has been chosen despite the potential for
multicollinearity. This is because the adjusted Generalized Variance
Inflation Factor (GVIF) values are only marginally above the threshold
of 5, suggesting that multicollinearity may not substantially impair the
interpretability or predictive power of our model. Additionally, the
stepwise regression procedure, which includes both backward and forward
selection based on the Akaike Information Criterion (stepAIC), has
indicated that the full model is the most appropriate. The use of
stepAIC ensures that each variable’s inclusion is justified based on its
contribution to model fit while penalizing unnecessary complexity.
The Box-Cox transformation was applied and subsequently chosen due to
its significant improvement of the model fit over the non-transformed
version. This transformation addresses issues of non-normality in the
residuals, which can bias parameter estimates and standard errors,
leading to invalid inferences. By stabilizing variance and making the
data more closely meet the assumptions of linear regression, the Box-Cox
transformed model provides a more reliable foundation for drawing
conclusions about the relationships between variables.
We have opted not to go with log transforming any of the predictor
variables. While transforming the predictors increased the adjusted
R-squared value by 0.03, the model was complicated and confusing to
interpret. In our case, we prioritize the practical significance and
interpretability of the model coefficients in their original units.
Moreover, the marginal improvement in the residuals from the log
transformation does not outweigh the benefits of maintaining a model
that is straightforward to interpret for practitioners and
stakeholders.
Thus, the full model with Box-Cox transformation is chosen for its
balance of statistical and practical significance.
4 Bootstrapping
Bootstrapping is used to estimate the distribution of a sample
statistic without making assumptions about the underlying population
distribution. By repeatedly resampling with replacement from the data,
it generates a new distribution which can be used to assess the
variability and stability of the model estimates. This works because it
operates on the assumption that this sample is a good representation of
the population, so resampling generates new samples of the actual
population. This technique helps us alleviate concerns about the
influence of outliers, the normality of residuals, and the precision of
confidence/prediction intervals.
4.1 Bootstrap Cases
Bootstrap cases is the process of resampling an entire observation
from the original sample data. This preserves any correlation between
variables and is a more holistic approach to preserving each
observation.
To estimate the confidence intervals, we performed bootstrapping with
B = 1000 replicates. The coefficients from these replicates are stored
in a matrix ‘coef.mtrx’.
Histograms are then made of the distribution of bootstrap regression
coefficients where one observation is one run of the bootstrap
regression.

We then visualize the distribution of the bootstrap estimate for each
coefficient using histograms and overaly two density curves: - The red
density curve is the estimated regression coefficients and its p-values
are reported in the output. - The blue density curve is the estimate
based off the bootstrap sampling distribution.
The histograms show that the red and blue density curves are closely
aligned, indicating consistency between the significance tests
(p-values) and the bootstrap confidence intervals.
95% Confidence intervals are conducted on these bootstrap regression
coefficients and are combined with the final model’s coefficient
matrix.
Regression Coefficient Matrix
|
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
btc.ci.95
|
|
(Intercept)
|
-2.1900
|
0.4194
|
-5.2217
|
0.0000
|
[ -3.1524 , -1.2794 ]
|
|
longitude
|
-0.1604
|
0.0049
|
-32.9988
|
0.0000
|
[ -0.1719 , -0.1495 ]
|
|
latitude
|
-0.1562
|
0.0048
|
-32.5882
|
0.0000
|
[ -0.1678 , -0.1451 ]
|
|
housing_median_age
|
0.0025
|
0.0002
|
11.7091
|
0.0000
|
[ 0.002 , 0.0029 ]
|
|
total_rooms
|
-0.0000
|
0.0000
|
-2.2477
|
0.0246
|
[ 0 , 0 ]
|
|
total_bedrooms
|
0.0003
|
0.0000
|
9.5032
|
0.0000
|
[ 0.0002 , 0.0004 ]
|
|
population
|
-0.0002
|
0.0000
|
-34.8716
|
0.0000
|
[ -0.0002 , -0.0001 ]
|
|
households
|
0.0004
|
0.0000
|
11.3658
|
0.0000
|
[ 0.0002 , 0.0005 ]
|
|
median_income
|
0.0000
|
0.0000
|
104.8550
|
0.0000
|
[ 0 , 0 ]
|
|
ocean_proximityINLAND
|
-0.3099
|
0.0083
|
-37.2314
|
0.0000
|
[ -0.3299 , -0.2921 ]
|
|
ocean_proximityISLAND
|
0.6016
|
0.1475
|
4.0773
|
0.0000
|
[ 0.3006 , 0.8211 ]
|
|
ocean_proximityNEAR BAY
|
-0.0374
|
0.0091
|
-4.0960
|
0.0000
|
[ -0.0556 , -0.0195 ]
|
|
ocean_proximityNEAR OCEAN
|
-0.0319
|
0.0075
|
-4.2544
|
0.0000
|
[ -0.047 , -0.0158 ]
|
The summary table reveals that the significance tests based on
p-values align with the bootstrap confidence intervals, confirming our
robust results.
4.2 Bootstrap Residuals
Bootstrap residuals only resamples the residuals of a fitted model.
This involves creating a fitted model to the data, calculating the
residuals, then resampling with replacement from these residuals.
However, this method operates on the assumptions that the model
structure is accurate and that non-error components are nonrandom.
In this case, we are bootstrapping the residuals from the model that
has all predictors and log(Median House Value).
A histogram is made of the residual bootstrap estimates for each
regression coefficient.

In the coefficient histograms, we observe that the normal and LOESS
density curves closely align, which indicates that the p-value-based
inference and residual bootstrap inference yielded consistent
results.
Regression Coefficient Matrix with 95% Residual Bootstrap CI
|
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
btr.ci.95
|
|
(Intercept)
|
-2.1900
|
0.4194
|
-5.2217
|
0.0000
|
[ -2.991 , -1.421 ]
|
|
longitude
|
-0.1604
|
0.0049
|
-32.9988
|
0.0000
|
[ -0.17 , -0.1515 ]
|
|
latitude
|
-0.1562
|
0.0048
|
-32.5882
|
0.0000
|
[ -0.1661 , -0.1469 ]
|
|
housing_median_age
|
0.0025
|
0.0002
|
11.7091
|
0.0000
|
[ 0.002 , 0.0029 ]
|
|
total_rooms
|
-0.0000
|
0.0000
|
-2.2477
|
0.0246
|
[ 0 , 0 ]
|
|
total_bedrooms
|
0.0003
|
0.0000
|
9.5032
|
0.0000
|
[ 0.0002 , 0.0003 ]
|
|
population
|
-0.0002
|
0.0000
|
-34.8716
|
0.0000
|
[ -0.0002 , -0.0002 ]
|
|
households
|
0.0004
|
0.0000
|
11.3658
|
0.0000
|
[ 0.0003 , 0.0004 ]
|
|
median_income
|
0.0000
|
0.0000
|
104.8550
|
0.0000
|
[ 0 , 0 ]
|
|
ocean_proximityINLAND
|
-0.3099
|
0.0083
|
-37.2314
|
0.0000
|
[ -0.3266 , -0.293 ]
|
|
ocean_proximityISLAND
|
0.6016
|
0.1475
|
4.0773
|
0.0000
|
[ 0.3251 , 0.885 ]
|
|
ocean_proximityNEAR BAY
|
-0.0374
|
0.0091
|
-4.0960
|
0.0000
|
[ -0.0559 , -0.0195 ]
|
|
ocean_proximityNEAR OCEAN
|
-0.0319
|
0.0075
|
-4.2544
|
0.0000
|
[ -0.0464 , -0.0182 ]
|
The resulting intervals align with the significance tests based on
p-values, suggesting the estimated coefficients approximate normal
distributions well, likely due to a sufficiently large sample size.
5 Results
5.1 Combining Inferential Statistics
We synthesize inferential statistics into a single table for
comparison of both bootstrap methods:
Final Combined Inferential Statistics: p-values and Bootstrap CIs
|
|
Estimate
|
Std. Error
|
Pr(>|t|)
|
btc.ci.95
|
btr.ci.95
|
|
(Intercept)
|
-2.1900
|
0.4194
|
0.0000
|
[ -3.1524 , -1.2794 ]
|
[ -2.991 , -1.421 ]
|
|
longitude
|
-0.1604
|
0.0049
|
0.0000
|
[ -0.1719 , -0.1495 ]
|
[ -0.17 , -0.1515 ]
|
|
latitude
|
-0.1562
|
0.0048
|
0.0000
|
[ -0.1678 , -0.1451 ]
|
[ -0.1661 , -0.1469 ]
|
|
housing_median_age
|
0.0025
|
0.0002
|
0.0000
|
[ 0.002 , 0.0029 ]
|
[ 0.002 , 0.0029 ]
|
|
total_rooms
|
-0.0000
|
0.0000
|
0.0246
|
[ 0 , 0 ]
|
[ 0 , 0 ]
|
|
total_bedrooms
|
0.0003
|
0.0000
|
0.0000
|
[ 0.0002 , 0.0004 ]
|
[ 0.0002 , 0.0003 ]
|
|
population
|
-0.0002
|
0.0000
|
0.0000
|
[ -0.0002 , -0.0001 ]
|
[ -0.0002 , -0.0002 ]
|
|
households
|
0.0004
|
0.0000
|
0.0000
|
[ 0.0002 , 0.0005 ]
|
[ 0.0003 , 0.0004 ]
|
|
median_income
|
0.0000
|
0.0000
|
0.0000
|
[ 0 , 0 ]
|
[ 0 , 0 ]
|
|
ocean_proximityINLAND
|
-0.3099
|
0.0083
|
0.0000
|
[ -0.3299 , -0.2921 ]
|
[ -0.3266 , -0.293 ]
|
|
ocean_proximityISLAND
|
0.6016
|
0.1475
|
0.0000
|
[ 0.3006 , 0.8211 ]
|
[ 0.3251 , 0.885 ]
|
|
ocean_proximityNEAR BAY
|
-0.0374
|
0.0091
|
0.0000
|
[ -0.0556 , -0.0195 ]
|
[ -0.0559 , -0.0195 ]
|
|
ocean_proximityNEAR OCEAN
|
-0.0319
|
0.0075
|
0.0000
|
[ -0.047 , -0.0158 ]
|
[ -0.0464 , -0.0182 ]
|
This table demonstrates that all 3 methods came up with significant
individual predictor variables. As a result of the findings for
bootstrap residuals, bootstrap cases, and traditional inferential
methods yielding significant predictors, it validates the model’s
estimates. This demonstrates that the model is stable, implies that the
predictive relationships are truly present in the data, and concludes
that the findings are robust.
5.2 Interpretation in Context
Inferential Statistics of Final Model
|
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
|
(Intercept)
|
-2.1900420
|
0.4194125
|
-5.221689
|
0.0000002
|
|
longitude
|
-0.1604085
|
0.0048610
|
-32.998810
|
0.0000000
|
|
latitude
|
-0.1562148
|
0.0047936
|
-32.588174
|
0.0000000
|
|
housing_median_age
|
0.0024536
|
0.0002095
|
11.709063
|
0.0000000
|
|
total_rooms
|
-0.0000083
|
0.0000037
|
-2.247698
|
0.0246060
|
|
total_bedrooms
|
0.0002703
|
0.0000284
|
9.503196
|
0.0000000
|
|
population
|
-0.0001779
|
0.0000051
|
-34.871571
|
0.0000000
|
|
households
|
0.0003628
|
0.0000319
|
11.365763
|
0.0000000
|
|
median_income
|
0.0000167
|
0.0000002
|
104.855008
|
0.0000000
|
|
ocean_proximityINLAND
|
-0.3099273
|
0.0083244
|
-37.231402
|
0.0000000
|
|
ocean_proximityISLAND
|
0.6015525
|
0.1475369
|
4.077303
|
0.0000457
|
|
ocean_proximityNEAR BAY
|
-0.0374307
|
0.0091384
|
-4.095974
|
0.0000422
|
|
ocean_proximityNEAR OCEAN
|
-0.0318756
|
0.0074924
|
-4.254393
|
0.0000211
|
The final model has 8 numerical predictors and 1 categorical variable
that is split into 4 dummy variables. As a result, the model conforms to
the following format:
\(Log(MedianHouseValue) = \beta_0 + \beta_1
* Longitude + \beta_2 * Latitude + \beta_3 * HousingMedianAge + \beta_4
* TotalRooms + \beta_5 * TotalBedrooms + \beta_6 * Population + \beta_7
* Households + \beta_8 * MedianIncome + \beta_9 * Ocean
Proximity_{Inland} + \beta_10 * Ocean Proximity_{Island} + \beta_11 *
Ocean Proximity_{Near Bay} + \beta_12 * Ocean Proximity_{Near Ocean} +
\epsilon\)
where each beta represents a coefficient in the linear model and
epsilon represents the residuals or error value.
The final model is as follows:
Log(Median House Value) = -2.19 - 0.16 * Longitude - 0.156 * Latitude
+ 0.00245 * Housing Median Age - \(8.3 *
10^{-6}\) * Total Rooms + \(2.7 *
10^{-4}\) * Total Bedrooms - \(1.78 *
10^{-4}\) * Population + \(3.63 *
10^{-4}\) * Households + \(1.67 *
10^{-5}\) * Median Income - 0.310 * \(Ocean Proximity_{Inland}\) + 0.602 * \(Ocean Proximity_{Island}\) - 0.0374 * \(Ocean Proximity_{Near Bay}\) - 0.0319 *
\(Ocean Proximity_{Near Ocean}\).
In context, this means that if every other predictor variable was
held constant, a one degree change in Longitude would result in
Log(Median House Value) decreasing by 0.16. This logic applies to every
other regression coefficient.
One thing to note, however, is that the response variable is
Log(Median House Value). So if you wanted to interpret this value in
terms of US Dollars, you’d take Euler’s number to the power of the
response to convert it from Log(US Dollars) to US Dollars.
6 Discussion
6.1 Main Findings
The main findings are that the original full model was rife with
violations in the assumptions. In order to alleviate these concerns,
many transformations were attempted i.e. reducing the model, box-cox
transformations, log transformations of predictor variables, etc. In the
end a decision was made to balance statistical significance with
practical significance, which resulted in going with a simplified
version of the Box-Cox transformation where the response variable was
the only variable transformed.
As a result, both bootstrap methods and the traditional inferential
statistics found that all predictor variables were statistically
significant at significance level \(\alpha\) = 0.05. Using bootstrap confidence
intervals, we further found that these findings are robust in spite of
minor violations of the assumptions. This means that we can comfortably
recommend using this model to predict median house value in California
homes within the context of US Census Bureau block groups.
6.2 Drawbacks and Future Improvements
This decision to forego statistical significance to enhance model
interpretability comes with drawbacks.
Less Predictive power: Log-transforming the predictor
variables resulted in a higher R^2 value, which indicates better
predictive power. By purposefully choosing a model with lower R^2, we
have elected to have a model that isn’t as good at predicting.
Overlooking interactions: By choosing to simplify the
model, we forego any analysis of potential interactions between
variables.
In addition, collinearity and multicollinearity in the model was not
adequately addressed. Many of the variables such as Total Rooms and
Total Bedrooms could linearly predict each other relatively accurate.
This poses a problem by inflating standard errors and could lead to
unreliable estimates for regression models. Some of the regression
coefficients in the model were relatively low in magnitude, indicating
that certain variables had less of an impact on the final predicting of
the response; these variables could be reconsidered on whether or not
they can be foregone.
6.3 Applications of Final Model
This model is useful in predicting housing prices in California based
of 1990 US Census data. This model can be used comparatively with future
census results in California to compare in a time series on the impacts
of certain predictors to see if certain predictors become more or less
important towards predicting housing prices.
It’s recommended that this model be used with caution as there are
still minor violations in assumptions. While, these have been addressed
by bootstrap methods and transformations, this could cause the model to
be inaccurate at the higher and lower extremes. In addition, the
underlying data was discretized before model analysis was run on it;
this can also impact any results the model will provide.