1 Data set Description

The data set was found on Kaggle. It includes information gathered from various block groups in California during the 1990 Census. The U.S. Census Bureau uses the block group as the smallest geographical unit, typically consisting of 600 to 3000 people. Each observation in the data set represents one block group and comprises an average of 1425.5 individuals in a geographically compact area.

The data set contains 20640 observations of 9 dependent variables and one independent variable: median house value. The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%203/ca-housing-price.csv.

2 Research Question

The primary objective of this analysis is to identify the association between median house value of California houses and key predictor variables that describe geographical location, housing characteristics, demographic details, and economic indicators of block groups and their inhabitants.

3 Exploratory Data Analysis

The original data set contains 9 independent variables and 1 dependent variable. However, some modifications need to be made as not all the variables are useful in describing the association with median house values.

3.1 Geographic Location

There are 3 variables that describe geographic locations of houses: longitude, latitude, and ocean_proximity. A plot was first created that plots house locations by longitude and latitude.

## Set echo to false because the code in this section is ugly

cahouses = read.csv("https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%203/ca-housing-price.csv", 
   header = TRUE)

lon <- cahouses$longitude
lat <- cahouses$latitude

par(asp = 1)
plot(lon, lat, main = "CA houses in 1990 Census", xlab = "Longitude", ylab = "Latitude")

The longitude and latitude values plotted depict precise values that describe the data in terms of geographical location.

Next, 5 histograms were plotted based off what value each observation had for OceanProximity.

As depicted above, despite there being 5 unique values for OceanProximity, <1H from ocean, Near ocean, and Near bay all have distributions that closely resemble one another.

As a result, a decision was made to exclude OceanProximity as a predictor in the first candidate model because it may not strongly indicate association of geographical location to house price. Another reason why OceanProximity was excluded was because possible collinearity may be present between ocean proximity and the longitude and latitude variables; since this may be of concern, longitude and latitude were kept in the model because they are more precise and thus better indicators for the model.

3.2 Median Income

There is a discrepancy in units for median income of households and median house value. The units for median income are in thousands of US Dollars whereas the units for median house value are in US Dollars. As such, unit conversion for median income was performed to convert units to US Dollars to be more consistent.

cahouses$median_income <- cahouses$median_income * 10000

4 Linear Model

The initial full model will consist of 8 predictor variables (all the original ones minus OceanProximity and with the updated MedianIncome). The linear model is represented as the following equation:

\(MedianHouseValue = \beta_0 + \beta_1 * Longitude + \beta_2 * Latitude + \beta_3 * HousingMedianAge + \beta_4 * TotalRooms + \beta_5 * TotalBedrooms + \beta_6 * Population + \beta_7 * Households + \beta_8 * MedianIncome + \epsilon\)

where each beta represents a coefficient in the linear model and epsilon represents the residuals or error value.

4.1 Initial Model

By running lm(), the linear model of the initial model can be run and output coefficients for the model.

initial_model = lm(median_house_value ~. - ocean_proximity, data = cahouses)


##I'm unsure how to get P-value to display something other than just 0, setting digits = 2 didn't work
kable(summary(initial_model)$coef, caption = "Statistics of Regression Coefficients", digits = 2, scientific = TRUE)
Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3585395.75 62900.54 -57.00 0
longitude -42730.12 717.09 -59.59 0
latitude -42509.74 676.95 -62.80 0
housing_median_age 1157.90 43.39 26.69 0
total_rooms -8.25 0.79 -10.39 0
total_bedrooms 113.82 6.93 16.42 0
population -38.39 1.08 -35.41 0
households 47.70 7.55 6.32 0
median_income 4.03 0.03 119.50 0
alteredcoef_model <- format(initial_model$coefficients, scientific = FALSE, digits = 2) # Round to 2 decimal points

Adding the coefficients to the model gives us the following:

MedianHouseValue = -3585395.7 + -42730.1 * Longitude + -42509.7 * Latitude + 1157.9 * HousingMedianAge + -8.2 * TotalRooms + 113.8 * TotalBedrooms + -38.4 * Population + 47.7 * Households + 4.0 * MedianIncome + \(\epsilon\).

4.2 Residual Analysis

4 diagnostic plots are run to see if assumptions about residuals are met. The Versus Fits plot checks for linearity of residuals, Q-Q plot checks for normal distribution of residuals, Scale-Location checks for constant variance of residuals, and the Residuals vs Leverage plot checks for influential points.

par(mfrow = c(2,2))
plot(initial_model)

Based off the residual plots, there are some minor violations:

  • Observation #15361 is an influential point and can be seen affecting the linearity of the model.
  • Variance of the residuals is not constant
  • The QQ plot indicates a slightly non-normal distribution. Mainly observation 15361 is affecting the plot’s distribution.

4.3 Variance Inflation Factors

VIF indices were constructed to detect issues with multicollinearity

vif(initial_model)
FALSE          longitude           latitude housing_median_age        total_rooms 
FALSE           8.713740           8.828919           1.260015          12.717000 
FALSE     total_bedrooms         population         households      median_income 
FALSE          36.003726           6.371238          35.136045           1.731511

The model displays significant issues with multicollinearity. No multicollinearity is present with VIF indices close to 1, and there is concern if the indices are 10 or larger. Considering that 3 indices are above 10 and 2 are close to 10, this model may need adjustments to balance out this issue.

5 Bootstrap Confidence Intervals

Since parametric assumptions cannot be made for the underlying distribution, bootstrap confidence intervals are used to make inferences.

After conducting 1000 replications of bootstrap samples, 95% confidence intervals were conducted on every beta coefficient in the linear model.

The 95% confidence intervals are as follows:

Since none of the 95% confidence intervals for each of these coefficients contains 0, that means that at the \(\alpha\) = 0.05 level, all coefficients in the linear model are statistically significant.

6 Box-Cox Transformation

A Box-Cox Transformation is performed to correct for heteroscedasticity and non-normal distribution of the residuals.

boxcox(median_house_value~.-ocean_proximity, data = cahouses, lambda = seq(0.15, 0.25, length = 10))

The lambda value from the box-cox suggests that a lambda of 0.22 may be effective in transforming the model.

bctransformed_model <- lm((median_house_value)^0.22 ~. -ocean_proximity, data = cahouses)

kable(summary(bctransformed_model)$coef, caption = "log-transformed model", digits = 2)
log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.52 0.95 -60.34 0
longitude -0.83 0.01 -76.15 0
latitude -0.84 0.01 -81.49 0
housing_median_age 0.01 0.00 18.39 0
total_rooms 0.00 0.00 -8.80 0
total_bedrooms 0.00 0.00 14.93 0
population 0.00 0.00 -34.06 0
households 0.00 0.00 6.86 0
median_income 0.00 0.00 113.24 0
par(mfrow = c(2,2))
plot(bctransformed_model)

It appears that the residual plots after performing the box-cox transformation are significantly better.

6.1 Removing Observation #15361

However, the influential point at observation 15361 significantly affects the data adversely. Observation 15361 has a Latitude of 33.35 and a Longitude of -117.42, which is where the Marine Corps’ Base Camp Pendleton. This explains why the predictor variables poorly predict for this location. As such, a second model will be conducted by removing this point.

#Remove observation 15361
noinfpt_data <- cahouses[-15361,]


noinfpt_model <- lm(median_house_value~.-ocean_proximity, data = noinfpt_data)

#par(mfrow = c(2,2))
#plot(noinfpt_model)

noinfpt_bctransformed_model <- lm((median_house_value)^0.22 ~. -ocean_proximity, data = noinfpt_data)

kable(summary(noinfpt_bctransformed_model)$coef, caption = "log-transformed model", digits = 2)
log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.49 0.95 -60.51 0
longitude -0.83 0.01 -76.48 0
latitude -0.84 0.01 -82.01 0
housing_median_age 0.01 0.00 18.07 0
total_rooms 0.00 0.00 -7.86 0
total_bedrooms 0.00 0.00 14.23 0
population 0.00 0.00 -36.05 0
households 0.00 0.00 8.45 0
median_income 0.00 0.00 112.79 0
par(mfrow = c(2,2))
plot(noinfpt_bctransformed_model)

6.2 Box-Cox after Removing #15361

After removing the influential point and then performing a box-cox transformation, the residual plots look substantially better.

noinfpt_bc_model <- lm((median_house_value)^0.22 ~. -ocean_proximity, data = noinfpt_data)

kable(summary(noinfpt_bc_model)$coef, caption = "log-transformed model w/o outlier", digits = 2)
log-transformed model w/o outlier
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.49 0.95 -60.51 0
longitude -0.83 0.01 -76.48 0
latitude -0.84 0.01 -82.01 0
housing_median_age 0.01 0.00 18.07 0
total_rooms 0.00 0.00 -7.86 0
total_bedrooms 0.00 0.00 14.23 0
population 0.00 0.00 -36.05 0
households 0.00 0.00 8.45 0
median_income 0.00 0.00 112.79 0
par(mfrow = c(2,2))
plot(noinfpt_bc_model)

7 Goodness of Fit Measures

Goodness-of-fit measures help to assess how well the model fits the observed data. These measures offer a quantitative value for how closely predicted values model the actual observed values.

The Goodness of Fit measures used here to assess the candidate models are the following:

It’s worth noting that SSE, AIC, SBC, and PRESS depend on the magnitude of the response. \(R^2\), \(R^2_{adj}\), and \(C_p\) scale accordingly and are independent of response. This means that when the response is transformed in log transformations or box-cox, \(R^2\), \(R^2_{adj}\), and \(C_p\) are better at measuring goodness of fit.

7.1 Candidate Models

The models that will be analyzed will be the full model with slight modifications (no ocean proximity, median_income * 10000), box-cox transformation of the model, and box-cox transformation with the outlier removed.

7.2 Goodness-of-fit on Candidate models

select=function(m){ # m is an object: model
 e = m$resid                           # residuals
 n0 = length(e)                        # sample size
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
 R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
 R.adj=summary(m)$adj.r                # Adjusted R square
 MSE=(summary(m)$sigma)^2              # square error
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
 X=model.matrix(m)                     # design matrix of the model
 H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
 }


## Edited this because the original kable was difficult to read. 

select(initial_model)
FALSE              SSE      R.sq     R.adj Cp      AIC      SBC          PRESS
FALSE 1 98856034611384 0.6369117 0.6367695  9 455669.4 455740.7 99196440712343
select(bctransformed_model)
FALSE        SSE     R.sq     R.adj Cp      AIC      SBC    PRESS
FALSE 1 22707.56 0.651455 0.6513185  9 2174.635 2245.959 22780.04
select(noinfpt_bc_model)
FALSE        SSE      R.sq     R.adj Cp      AIC     SBC   PRESS
FALSE 1 22560.41 0.6537088 0.6535732  9 2042.696 2114.02 22595.6

These goodness of fit measures demonstrate that the the box-cox transformation with the outlier removed appears to be the best model. This is because it had the largest adjusted R-squared value, measuring 0.6536. This means that 65.36% of the variability in median house value could be explained by the box-cox transformed model.

8 Final Model

As such, the final model is the linear model without observation 15361 present and response variable ‘median house value’ taken to the 0.22 power (what the lambda value was). The inferential statistics is summarized in the following table.

#Used base R to print because p-values kept displaying 0 instead of scietnific notation

#kable(summary(bctransformed_model)$coef, caption = "Inferential Statistics of Final Model", digits = 6, scientific = TRUE)


summary_coefs <- summary(noinfpt_bc_model)$coef

print(summary_coefs, main = "Inferential Statsitics of Final Model")
FALSE                           Estimate   Std. Error    t value      Pr(>|t|)
FALSE (Intercept)        -57.49498102093 9.502503e-01 -60.505089  0.000000e+00
FALSE longitude           -0.82854033708 1.083346e-02 -76.479763  0.000000e+00
FALSE latitude            -0.83897988681 1.022994e-02 -82.012232  0.000000e+00
FALSE housing_median_age   0.01185125353 6.558045e-04  18.071321  1.966645e-72
FALSE total_rooms         -0.00009457417 1.203948e-05  -7.855336  4.183137e-15
FALSE total_bedrooms       0.00149272078 1.049053e-04  14.229218  9.984840e-46
FALSE population          -0.00062942535 1.745768e-05 -36.054355 5.060414e-276
FALSE households           0.00097279578 1.151604e-04   8.447309  3.177695e-17
FALSE median_income        0.00005754665 5.102028e-07 112.791734  0.000000e+00

8.1 Summary of the model

The final model can be written as the following:

\(Median House Value^{0.22}\) = -57.495 -0.829 * Longitude - 0.839 * Latitude + 0.0119 * Housing Median Age - 9.46E\(10^{-5}\) * Total Rooms + 1.49E\(10^{-3}\) * Total Bedrooms - 6.29E\(10^{-4}\) * Population + 9.73E\(10^{-4}\) * Households + 5.75E\(10^{-5}\) * Median Income + \(\epsilon\).

Each coefficient represents if one unit increase is made for that predictor variable with everything else held constant, that the median house value^0.22 will increase(+) or decrease(-) by that coefficient amount.

For example, what this means is that per one degree increase in longitude, housing price^0.22 will decrease by 0.829.

Or for an increase of 1 US Dollar in Median Income will increase median house value^0.22 by 5.75 * \(10^{-5}\).

8.2 Residual Analysis on Final Model

The residual plots demonstrate that there is still issues present in normality, linearity, and heteroscedasticity. No large influential points remain.

par(mfrow = c(2,2))
plot(noinfpt_bc_model)

9 Conlusions and Discussion

There was significant issues with residual assumptions as well that were partially rectified with modified transformation and removal of the outlier. Still, the violations for these residual assumptions are uncorrected. The inference is based on the central limit theorem given the large sample size of N = 20640.

One pressing concern is the troubling multicollinearity. That was not addressed in this report, and with the results seen in the final model, it largely reflects this as many of the coefficients were small. Going forward, it would have been wise to reduce this model. From a purely logical perspective, many pairs of variables are likely to be highly correlated: rooms and bedrooms; population and households could be easily be justified to be correlated in one way or another.

However, for the purposes of association, this report strongly demonstrates that all of these predictor variables affect the median house price in California. It doesn’t serve as a good basis for prediction, especially considering its logarithmic transformation of the response variable.

Bootstrap methodology was only used partially. If it were used throughout the entire analysis and the model reduced to remove multicollinearity, the resultant model would perform substantially better than the final model here.

References

Wang, H.(2018). housing.csv[Data set]. https://www.kaggle.com/code/harrywang/housing-price-prediction/input?select=housing.csv

O’Reilly Media (2017). California Housing. GitHub. https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv

GPS coordinates, latitude and longitude with interactive maps. GPS coordinates, latitude and longitude with interactive Maps. (2023). https://www.gps-coordinates.net/