The “HomesForSale” dataset, sourced from Lock5Stat, offers valuable insights into the housing market across four states in the U.S.: California (CA), New York (NY), New Jersey (NJ), and Pennsylvania (PA). This dataset provides a platform to explore relationships between various home features and their influence on housing prices, as well as regional differences in market values.
This analysis focuses on the following questions to understand key housing market dynamics:
Influence of Home Size on Price in California: How does the size (square footage) of a home impact its price in California?
Impact of Bedrooms on Price in California: Does the number of bedrooms in a California home significantly influence its price?
Impact of Bathrooms on Price in California: How does the number of bathrooms affect home prices in California?
Joint Influence of Size, Bedrooms, and Bathrooms on Price in California: To what extent do these three factors collectively determine home prices in California?
Regional Differences in Home Prices: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This question investigates whether a home’s state significantly influences its price.
Using regression and ANOVA methods, this study interprets the impact of individual and multiple variables on home prices. Regression models are applied for questions 1-4, with a focus on the slope and p-values to evaluate statistical significance. For question 5, an ANOVA analysis determines if differences in prices across states are statistically significant. This investigation provides a data-driven understanding of housing market trends and highlights factors influencing home prices across regions.
The data that we will use to analyze these questions come from the “HomesForSale” dataset, sourced from Lock5Stat. This dataset includes samples of homes for sale in each state, selected from zillow.com in 2019. This data is a data frame with 120 observations on the following 5 variables.
State: Location of the home (CA, NJ, NY, or
PA)
Price: Asking price (in $1,000’s)
Size: Area of all rooms (in 1,000’s sq. ft.)
Beds: Number of bedrooms
Baths: Number of bathrooms
The data is shown in the data table below.
This question focuses on understanding the relationship between a home’s size and its price in California. By isolating data for California homes, we use a simple linear regression to examine whether the square footage significantly influences home prices. This analysis provides insight into how much home buyers value additional space in the California housing market.
# Filter data for California
Homes_CA <- subset(Homes, State == "CA")
# Check for missing values and remove them
Homes_CA <- na.omit(Homes_CA)
# Linear regression: Price vs Size
model_size <- lm(Price ~ Size, data = Homes_CA)
# Summary of the model
summary(model_size)
##
## Call:
## lm(formula = Price ~ Size, data = Homes_CA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
Regression Results:
The slope estimate for size is 0.33919, indicating that for every additional 1,000 square feet, the home price increases by approximately $339,190.
The p-value for size is 0.000463, which is highly significant (p < 0.001), suggesting a strong statistical relationship between home size and price.
The \(𝑅^2\) value is 0.3594, meaning that about 36% of the variability in home prices is explained by size alone.
As shown in the regression model data we find that the size of a home significantly impacts its price, with larger homes commanding higher prices in California. However, the \(R^2\) value suggests that other factors not included in the model may also play a substantial role in determining prices.
To assess the impact of bedrooms on home prices, we analyze the subset of California data using a linear regression model. This analysis examines whether the number of bedrooms significantly contributes to home prices and helps determine if this feature is an important consideration for buyers in the California housing market.
# Linear regression: Price vs Beds
model_beds <- lm(Price ~ Beds, data = Homes_CA)
# Summary of the model
summary(model_beds)
##
## Call:
## lm(formula = Price ~ Beds, data = Homes_CA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
Regression Results:
The slope estimate for bedrooms is 84.77, indicating that each additional bedroom is associated with an average increase of $84,770 in price.
The p-value for bedrooms is 0.255, which is not significant (p > 0.05).
The \(𝑅^2\) value is 0.04605, showing that only about 4.6% of the variability in prices is explained by the number of bedrooms.
As shown in the regression model we find that the number of bedrooms does not appear to have a statistically significant influence on home prices in California. This result may indicate that buyers place more value on other features, such as home size or location, than on the number of bedrooms.
We investigate the relationship between the number of bathrooms and home prices in California using a simple linear regression model. This analysis explores whether additional bathrooms add significant value to a home and offers insights into their importance in the California real estate market.
# Linear regression: Price vs Baths
model_baths <- lm(Price ~ Baths, data = Homes_CA)
# Summary of the model
summary(model_baths)
##
## Call:
## lm(formula = Price ~ Baths, data = Homes_CA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
Regression Results:
The slope estimate for bathrooms is 194.74, indicating that each additional bathroom is associated with an average increase of $194,740 in price.
The p-value for bathrooms is 0.00409, which is statistically significant (p < 0.01).
The \(𝑅^2\) value is 0.2588, meaning that about 26% of the variability in prices is explained by the number of bathrooms.
As shown in the regression model data we find that the number of bathrooms has a significant and positive effect on home prices in California. This suggests that buyers value additional bathrooms more than additional bedrooms, likely because bathrooms contribute to both functionality and luxury in a home.
To evaluate the combined impact of size, number of bedrooms, and number of bathrooms on home prices, we fit a multiple regression model to the California data. This analysis helps identify which features independently predict prices after accounting for the others, offering a more nuanced understanding of how these characteristics interact to influence home values.
# Multiple regression: Price ~ Size + Beds + Baths
model_full <- lm(Price ~ Size + Beds + Baths, data = Homes_CA)
# Summary of the model
summary(model_full)
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = Homes_CA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
Regression Results:
The model’s \(𝑅^2\) value is 0.3912, indicating that approximately 39% of the variability in prices is explained by size, bedrooms, and bathrooms combined.
Size is the only significant predictor (p = 0.0259), with a slope estimate of 0.2811, suggesting that each additional 1,000 square feet increases the price by about $281,100.
Bedrooms (p = 0.6239) and bathrooms (p = 0.2839) are not statistically significant in this model.
In the multiple regression model we include all predictors simultaneously (size, beds, and baths) to determine their joint and independent contributions to predicting Price. This differs from the simple regressions in numbers 1-3, where each predictor is analyzed in isolation. When analyzing each predictor in isolation (1-3) we found both size and number of bathrooms to be significant with the number of bedrooms to be not significant. However, when doing multiple regression we find size to be the only significant predictor. This suggests that there is multicollinearity, meaning that in California homes, size and baths are likely correlated (e.g., larger homes tend to have more bathrooms). In a simple regression analyzing only bathrooms the model attributes all the effect of size-related variation in price to bathrooms because it’s the only predictor in the model. However, in the multiple regression model size captures the overlapping contribution first, leaving Baths with a smaller, possibly non-significant unique contribution to explain price.
Furthermore, in multiple regression, each predictor’s p-value reflects its unique contribution to price after accounting for the other predictors. The significance of a predictor depends on how much additional variance it explains. In number 3, the number of bathrooms was statistically significant because it explained a substantial proportion of the price variance by itself. However, in the multiple regression model, its contribution overlaps significantly with size, reducing its independent significance. Thus, while bathrooms are important, their effect overlaps with size, making size the dominant predictor when both are considered.
To conclude, we find that in the multiple regression model we find size to be the only significant predictor because bedrooms and bathrooms lose significance when analyzed together with size. This suggests multicollinearity, where size may already capture the effect of larger homes typically having more bedrooms and bathrooms. Thus, the model highlights that size is the dominant factor in predicting home prices in California.
This analysis investigates whether home prices differ significantly across the four states included in the dataset: California, New York, New Jersey, and Pennsylvania. Using an ANOVA model, we test for mean differences in prices and perform post-hoc pairwise comparisons to identify specific state-level variations. This exploration provides a clearer picture of regional price disparities in the housing market.
# One-way ANOVA to test price differences among states
anova_model <- aov(Price ~ State, data = Homes)
# Summary of the ANOVA model
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Price ~ State, data = Homes)
##
## $State
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951
ANOVA Results:
The p-value for the ANOVA is 0.000148, indicating significant differences in mean home prices among the four states.
The Tukey post-hoc test shows:
California homes are significantly more expensive than those in New Jersey, New York, and Pennsylvania.
Differences between New Jersey, New York, and Pennsylvania are not statistically significant.
The analysis confirms that location has a significant impact on home prices, with California having the highest average prices. This result reflects California’s high demand and limited supply in its housing market compared to the other states. The lack of significant differences among New Jersey, New York, and Pennsylvania suggests more uniformity in housing prices in those states.
This analysis explored various factors influencing home prices in California and regional price differences among four states (California, New York, New Jersey, and Pennsylvania). Key findings are summarized below:
Impact of Home Size on Price in California: Home size is a significant predictor of price in California, with larger homes commanding higher prices. The regression model showed that each additional 1,000 square feet adds approximately $339,190 to the price, explaining 36% of price variability.
Influence of Bedrooms on Price in California: The number of bedrooms does not have a statistically significant effect on home prices. Buyers may prioritize other features, such as home size or location, over the number of bedrooms.
Effect of Bathrooms on Price in California: Bathrooms have a significant positive impact on home prices, with each additional bathroom increasing the price by about $194,740. Bathrooms are more strongly associated with home prices than bedrooms, reflecting their contribution to a home’s functionality and luxury.
Joint Influence of Size, Bedrooms, and Bathrooms: When considered together in a multiple regression model, size is the only significant predictor of price. The significance of bathrooms and bedrooms diminishes, suggesting multicollinearity—larger homes often have more bathrooms and bedrooms, with size capturing much of the shared effect. This underscores the dominant role of size in determining home prices.
Regional Price Differences Among States: ANOVA results indicate significant differences in home prices across states. California homes are significantly more expensive than those in New York, New Jersey, and Pennsylvania. However, prices among New York, New Jersey, and Pennsylvania do not differ significantly. California’s high demand and limited housing supply likely drive its higher prices.
The analysis reveals that home size and the number of bathrooms are the most influential factors in determining California home prices, with size being the dominant factor when considered alongside other variables. Regionally, California’s housing market stands out for its higher prices compared to the other states analyzed, reflecting its unique market dynamics. These insights can inform both buyers and sellers about key drivers of home values in California and the broader regional market.