Introduction

This report aims to analyze the following data sets which were sourced from www.lock5stat.com (lock5): HomesForSale and HomesForSaleCA. This analysis will be done using linear regression for single variable, multi-variable, and the ANOVA method. The questions posed for this analysis are as follows:

Questions
  1. Evaluating for California only, how much does the size of a home influence its price?
  2. Evaluating for California only, how does the number of bedrooms of a home influence its price?
  3. Evaluating for California only, how does the number of bathrooms of a home influence its price?
  4. Evaluating for California only, how do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
  5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

Data

HomesForSale is a data set provided by lock5 and contains data from www.zillow.com (zillow) collected in 2019. The set contains 120 observations across five variables: State, Price, Size, Beds, and Baths. HomesForSaleCA is a subset of this data which contains 30 observations across the same five variables. The descriptions of each variable are as follows:

Analysis

This section will contain each question posed, the linear regression, a plot of the modeled data, and an analysis of the result.

Question 1

Evaluating for California only, how much does the size of a home influence its price?
Regression Results
Dependent variable:
Home Price
Home Size (sqft) 0.339
(0.086)
Constant -56.817
(154.681)
Observations 30
R2 0.359
Adjusted R2 0.337
Residual Std. Error 219.257 (df = 28)
F Statistic 15.709 (df = 1; 28)
Note: P-value for slope: 0.000463

Based on the regression analysis, the size of a home has a statistically significant effect on its price. Specifically, for every additional square foot, the price increases by approximately $339. The model explains around 36% of the variability in home prices, suggesting other factors also contribute to pricing. A t-value of 3.963 and a P of 4.63e-4 gives confidence that this is a statistically significant relationship.

Question 2

Evaluating for California only, how does the number of bedrooms of a home influence its price?
Regression: Price ~ Bedrooms
Dependent variable:
Home Price
Number of Bedrooms 84.767 (72.911)
Constant 269.762 (233.618)
Observations 30
R2 0.046
Adjusted R2 0.012
Residual Std. Error 267.560 (df = 28)
F Statistic 1.352 (df = 1; 28)
Note: P-value for slope: 0.254798

This time, the analysis indicated that while each additional bedroom was associated with a predicted $84.77 increase in price, the effect was not statistically significant (p = 0.255). Additionally, the model explained only 4.6% of the variation in price, suggesting that bedroom count is not a strong predictor of home price in this dataset.

Question 3

Evaluating for California only, how does the number of bathrooms of a home influence its price?
Regression: Price ~ Bathrooms
Dependent variable:
Home Price
Number of Bathrooms 194.739 (62.275)
Constant 90.712 (148.571)
Observations 30
R2 0.259
Adjusted R2 0.232
Residual Std. Error 235.838 (df = 28)
F Statistic 9.779 (df = 1; 28)
Note: P-value for slope: 0.004092

The linear regression conducted to evaluate the effect of the number of bathrooms on home price showed that each additional bathroom was associated with an increase of approximately $194.74 in home price, the relationship was statistically significant (p = 0.0041), and the model explained about 26% of the variance in home prices. This suggests that the number of bathrooms is an important factor in predicting home value.

Question 4

Evaluating for California only, how do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Multiple Regression: Price ~ Size + Bedrooms + Bathrooms
Dependent variable:
Home Price
Size 0.281 (0.119)
Number of Bedrooms -33.704 (67.926)
Number of Bathrooms 83.984 (76.753)
Constant -41.561 (210.381)
Observations 30
R2 0.391
Adjusted R2 0.321
Residual Std. Error 221.820 (df = 26)
F Statistic 5.568 (df = 3; 26)
Note: P-values — Size: 0.025857 , Bedrooms: 0.623933 , Bathrooms: 0.283894

A multiple regression was conducted to determine how size, number of bedrooms, and number of bathrooms jointly influence home price. The model explained 39.1% of the variance in home prices (Adjusted R² = 32.1%), and was statistically significant overall (F(3,26) = 5.568, p < 0.01).

  • Size had a statistically significant positive effect (p = 0.026), suggesting that larger homes tend to be more expensive.

  • Bedrooms did not show a significant effect (p = 0.624), indicating no clear price impact when controlling for size and bathrooms.

  • Bathrooms also did not reach statistical significance (p = 0.284), though the positive coefficient suggests a potential trend.

Question 5

Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?
ANOVA Table: Differences in Home Prices by State
Df Sum Sq Mean Sq F value Pr(>F)
State 3 1198169 399389.58 7.354696 0.000148
Residuals 116 6299266 54304.02 NA NA

There are statistically significant differences in average home prices among the four states (F(3,116) = 7.35, p < 0.001). As can be seen by the box plot, the median home price is closer to $600-thousand while the other states have a median between $200-$300-thousand.

Conclusion

As can be seen by the results of each analysis, there is a clear and statistically significant positive relationship between home size and price. The relationship between price and number of bedrooms or bathrooms is notably less statistically significant, but does still exhibit a positive relationship for both variables. It can also be seen that it is considerably more expensive to purchase a home in California compared to New Jersey, New York, and Pennsylvania, with the difference also being statistically significant.

Appendix

Code snippets used in the analysis of each question:

```r Code Snippets
-------------------------------------------------------------------------------------------------------------------------------
Question 1:

model <- lm(Price ~ Size, data = ca_data)

pval <- summary(model)$coefficients[2,4]

stargazer(model, type = "html", title = "Regression Results", 
          dep.var.labels = "Home Price", covariate.labels = "Home Size (sqft)",
          digits = 3, out = "model_output.html", star.cutoffs = NA,
          notes = paste("P-value for slope:", round(pval, 6)), 
          notes.append = FALSE)

plot(ca_data$Size, ca_data$Price, main = "Price vs Size",
     xlab = "Size (sq ft)", ylab = "Price", pch = 19, col = "blue")
abline(model, col = "red", lwd = 2)

-------------------------------------------------------------------------------------------------------------------------------
Question 2:

bedroom_model <- lm(Price ~ Beds, data = ca_data)

pval <- summary(bedroom_model)$coefficients[2,4]

stargazer(bedroom_model, type = "html", title = "Regression: Price ~ Bedrooms",
          dep.var.labels = "Home Price", covariate.labels = "Number of Bedrooms",
          digits = 3, single.row = TRUE, star.cutoffs = NA,
          notes = paste("P-value for slope:", round(pval, 6)), 
          notes.append = FALSE)

plot(ca_data$Beds, ca_data$Price,
     main = "Price vs Number of Bedrooms",
     xlab = "Number of Bedrooms", ylab = "Price",
     pch = 19, col = "darkblue")
abline(bedroom_model, col = "red", lwd = 2)

-------------------------------------------------------------------------------------------------------------------------------
Question 3:

bathroom_model <- lm(Price ~ Baths, data = ca_data)

pval <- summary(bathroom_model)$coefficients[2,4]

stargazer(bathroom_model, type = "html", title = "Regression: Price ~ Bathrooms",
          dep.var.labels = "Home Price", covariate.labels = "Number of Bathrooms",
          digits = 3, single.row = TRUE, star.cutoffs = NA,
          notes = paste("P-value for slope:", round(pval, 6)), 
          notes.append = FALSE)

plot(ca_data$Beds, ca_data$Price,
     main = "Price vs Number of Bathrooms",
     xlab = "Number of Bathrooms", ylab = "Price",
     pch = 19, col = "darkblue")
abline(bathroom_model, col = "red", lwd = 2)

-------------------------------------------------------------------------------------------------------------------------------
Question 4:

multi_model <- lm(Price ~ Size + Beds + Baths, data = ca_data)

pvals <- summary(multi_model)$coefficients[, 4]
p_note <- paste("P-values — Size: ", round(pvals["Size"], 6),
                 ", Bedrooms: ", round(pvals["Beds"], 6),
                 ", Bathrooms: ", round(pvals["Baths"], 6))

stargazer(multi_model,
          type = "html",  # Use "text" for console or "latex" for PDF output
          title = "Multiple Regression: Price ~ Size + Bedrooms + Bathrooms",
          dep.var.labels = "Home Price",
          covariate.labels = c("Size", "Number of Bedrooms", "Number of Bathrooms"),
          digits = 3,
          single.row = TRUE,
          star.cutoffs = NA,       # disables asterisks
          notes = p_note,
          notes.append = FALSE)

plot(multi_model, which = 1,
     main = "Residuals vs Fitted",
     col = "darkblue", pch = 19)

-------------------------------------------------------------------------------------------------------------------------------
Question 5:

library(knitr)
anova_model <- aov(Price ~ State, data = home_data)

kable(summary(anova_model)[[1]], 
      caption = "ANOVA Table: Differences in Home Prices by State",
      digits = 6)

boxplot(Price ~ State, data = home_data,
        main = "Home Prices by State",
        xlab = "State", ylab = "Price",
        col = "lightblue", border = "gray40")

-------------------------------------------------------------------------------------------------------------------------------
```