Introduction

The “HomesForSale” dataset, sourced from Lock5Stat, offers valuable insights into the housing market across four states in the U.S.: California (CA), New York (NY), New Jersey (NJ), and Pennsylvania (PA). This dataset provides a platform to explore relationships between various home features and their influence on housing prices, as well as regional differences in market values.

Questions for Analysis

This analysis focuses on the following questions to understand key housing market dynamics:

  1. Influence of Home Size on Price in California: How does the size (square footage) of a home impact its price in California?

  2. Impact of Bedrooms on Price in California: Does the number of bedrooms in a California home significantly influence its price?

  3. Impact of Bathrooms on Price in California: How does the number of bathrooms affect home prices in California?

  4. Joint Influence of Size, Bedrooms, and Bathrooms on Price in California: To what extent do these three factors collectively determine home prices in California?

  5. Regional Differences in Home Prices: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This question investigates whether a home’s state significantly influences its price.

Using regression and ANOVA methods, this study interprets the impact of individual and multiple variables on home prices. Regression models are applied for questions 1-4, with a focus on the slope and p-values to evaluate statistical significance. For question 5, an ANOVA analysis determines if differences in prices across states are statistically significant. This investigation provides a data-driven understanding of housing market trends and highlights factors influencing home prices across regions.

Data

The data that we will use to analyze these questions come from the “HomesForSale” dataset, sourced from Lock5Stat. This dataset includes samples of homes for sale in each state, selected from zillow.com in 2019. This data is a data frame with 120 observations on the following 5 variables.

State: Location of the home (CA, NJ, NY, or PA)
Price: Asking price (in $1,000’s)
Size: Area of all rooms (in 1,000’s sq. ft.)
Beds: Number of bedrooms
Baths: Number of bathrooms

The data is shown in the data table below.

library(DT)
Homes = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
datatable(Homes, options = list(scrollX = TRUE, pageLength = 15))

Analysis

1. How does the size (square footage) of a home impact its price in California?

This question focuses on understanding the relationship between a home’s size and its price in California. By isolating data for California homes, we use a simple linear regression to examine whether the square footage significantly influences home prices. This analysis provides insight into how much home buyers value additional space in the California housing market.

# Filter data for California
Homes_CA <- subset(Homes, State == "CA")

# Check for missing values and remove them
Homes_CA <- na.omit(Homes_CA)

# Linear regression: Price vs Size
model_size <- lm(Price ~ Size, data = Homes_CA)

# Summary of the model
summary(model_size)
## 
## Call:
## lm(formula = Price ~ Size, data = Homes_CA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634

Regression Results:

  • The slope estimate for size is 0.33919, indicating that for every additional 1,000 square feet, the home price increases by approximately $339,190.

  • The p-value for size is 0.000463, which is highly significant (p < 0.001), suggesting a strong statistical relationship between home size and price.

  • The \(𝑅^2\) value is 0.3594, meaning that about 36% of the variability in home prices is explained by size alone.

As shown in the regression model data we find that the size of a home significantly impacts its price, with larger homes commanding higher prices in California. However, the \(R^2\) value suggests that other factors not included in the model may also play a substantial role in determining prices.

2. Does the number of bedrooms in a California home significantly influence its price?

To assess the impact of bedrooms on home prices, we analyze the subset of California data using a linear regression model. This analysis examines whether the number of bedrooms significantly contributes to home prices and helps determine if this feature is an important consideration for buyers in the California housing market.

# Linear regression: Price vs Beds
model_beds <- lm(Price ~ Beds, data = Homes_CA)

# Summary of the model
summary(model_beds)
## 
## Call:
## lm(formula = Price ~ Beds, data = Homes_CA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548

Regression Results:

  • The slope estimate for bedrooms is 84.77, indicating that each additional bedroom is associated with an average increase of $84,770 in price.

  • The p-value for bedrooms is 0.255, which is not significant (p > 0.05).

  • The \(𝑅^2\) value is 0.04605, showing that only about 4.6% of the variability in prices is explained by the number of bedrooms.

As shown in the regression model we find that the number of bedrooms does not appear to have a statistically significant influence on home prices in California. This result may indicate that buyers place more value on other features, such as home size or location, than on the number of bedrooms.

3. How does the number of bathrooms affect home prices in California?

We investigate the relationship between the number of bathrooms and home prices in California using a simple linear regression model. This analysis explores whether additional bathrooms add significant value to a home and offers insights into their importance in the California real estate market.

# Linear regression: Price vs Baths
model_baths <- lm(Price ~ Baths, data = Homes_CA)

# Summary of the model
summary(model_baths)
## 
## Call:
## lm(formula = Price ~ Baths, data = Homes_CA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092

Regression Results:

  • The slope estimate for bathrooms is 194.74, indicating that each additional bathroom is associated with an average increase of $194,740 in price.

  • The p-value for bathrooms is 0.00409, which is statistically significant (p < 0.01).

  • The \(𝑅^2\) value is 0.2588, meaning that about 26% of the variability in prices is explained by the number of bathrooms.

As shown in the regression model data we find that the number of bathrooms has a significant and positive effect on home prices in California. This suggests that buyers value additional bathrooms more than additional bedrooms, likely because bathrooms contribute to both functionality and luxury in a home.

4. To what extent do size, bedrooms, and bathrooms collectively determine home prices in California?

To evaluate the combined impact of size, number of bedrooms, and number of bathrooms on home prices, we fit a multiple regression model to the California data. This analysis helps identify which features independently predict prices after accounting for the others, offering a more nuanced understanding of how these characteristics interact to influence home values.

# Multiple regression: Price ~ Size + Beds + Baths
model_full <- lm(Price ~ Size + Beds + Baths, data = Homes_CA)

# Summary of the model
summary(model_full)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = Homes_CA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353

Regression Results:

  • The model’s \(𝑅^2\) value is 0.3912, indicating that approximately 39% of the variability in prices is explained by size, bedrooms, and bathrooms combined.

  • Size is the only significant predictor (p = 0.0259), with a slope estimate of 0.2811, suggesting that each additional 1,000 square feet increases the price by about $281,100.

  • Bedrooms (p = 0.6239) and bathrooms (p = 0.2839) are not statistically significant in this model.

In the multiple regression model we include all predictors simultaneously (size, beds, and baths) to determine their joint and independent contributions to predicting Price. This differs from the simple regressions in numbers 1-3, where each predictor is analyzed in isolation. When analyzing each predictor in isolation (1-3) we found both size and number of bathrooms to be significant with the number of bedrooms to be not significant. However, when doing multiple regression we find size to be the only significant predictor. This suggests that there is multicollinearity, meaning that in California homes, size and baths are likely correlated (e.g., larger homes tend to have more bathrooms). In a simple regression analyzing only bathrooms the model attributes all the effect of size-related variation in price to bathrooms because it’s the only predictor in the model. However, in the multiple regression model size captures the overlapping contribution first, leaving Baths with a smaller, possibly non-significant unique contribution to explain price.

Furthermore, in multiple regression, each predictor’s p-value reflects its unique contribution to price after accounting for the other predictors. The significance of a predictor depends on how much additional variance it explains. In number 3, the number of bathrooms was statistically significant because it explained a substantial proportion of the price variance by itself. However, in the multiple regression model, its contribution overlaps significantly with size, reducing its independent significance. Thus, while bathrooms are important, their effect overlaps with size, making size the dominant predictor when both are considered.

To conclude, we find that in the multiple regression model we find size to be the only significant predictor because bedrooms and bathrooms lose significance when analyzed together with size. This suggests multicollinearity, where size may already capture the effect of larger homes typically having more bedrooms and bathrooms. Thus, the model highlights that size is the dominant factor in predicting home prices in California.

5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

This analysis investigates whether home prices differ significantly across the four states included in the dataset: California, New York, New Jersey, and Pennsylvania. Using an ANOVA model, we test for mean differences in prices and perform post-hoc pairwise comparisons to identify specific state-level variations. This exploration provides a clearer picture of regional price disparities in the housing market.

# One-way ANOVA to test price differences among states
anova_model <- aov(Price ~ State, data = Homes)

# Summary of the ANOVA model
summary(anova_model)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Post-hoc Tukey test to explore pairwise differences
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = Homes)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951

ANOVA Results:

  • The p-value for the ANOVA is 0.000148, indicating significant differences in mean home prices among the four states.

  • The Tukey post-hoc test shows:

    • California homes are significantly more expensive than those in New Jersey, New York, and Pennsylvania.

    • Differences between New Jersey, New York, and Pennsylvania are not statistically significant.

The analysis confirms that location has a significant impact on home prices, with California having the highest average prices. This result reflects California’s high demand and limited supply in its housing market compared to the other states. The lack of significant differences among New Jersey, New York, and Pennsylvania suggests more uniformity in housing prices in those states.

Summary/Conclusion

This analysis explored various factors influencing home prices in California and regional price differences among four states (California, New York, New Jersey, and Pennsylvania). Key findings are summarized below:

The analysis reveals that home size and the number of bathrooms are the most influential factors in determining California home prices, with size being the dominant factor when considered alongside other variables. Regionally, California’s housing market stands out for its higher prices compared to the other states analyzed, reflecting its unique market dynamics. These insights can inform both buyers and sellers about key drivers of home values in California and the broader regional market.