This report aims to analyze the following data sets which were sourced from www.lock5stat.com (lock5): HomesForSale and HomesForSaleCA. This analysis will be done using linear regression for single variable, multi-variable, and the ANOVA method. The questions posed for this analysis are as follows:
HomesForSale is a data set provided by lock5 and contains data from www.zillow.com (zillow) collected in 2019. The set contains 120 observations across five variables: State, Price, Size, Beds, and Baths. HomesForSaleCA is a subset of this data which contains 30 observations across the same five variables. The descriptions of each variable are as follows:
This section will contain each question posed, the linear regression, a plot of the modeled data, and an analysis of the result.
| Dependent variable: | |
| Home Price | |
| Home Size (sqft) | 0.339 |
| (0.086) | |
| Constant | -56.817 |
| (154.681) | |
| Observations | 30 |
| R2 | 0.359 |
| Adjusted R2 | 0.337 |
| Residual Std. Error | 219.257 (df = 28) |
| F Statistic | 15.709 (df = 1; 28) |
| Note: | P-value for slope: 0.000463 |
Based on the regression analysis, the size of a home has a statistically significant effect on its price. Specifically, for every additional square foot, the price increases by approximately $339. The model explains around 36% of the variability in home prices, suggesting other factors also contribute to pricing. A t-value of 3.963 and a P of 4.63e-4 gives confidence that this is a statistically significant relationship.
| Dependent variable: | |
| Home Price | |
| Number of Bedrooms | 84.767 (72.911) |
| Constant | 269.762 (233.618) |
| Observations | 30 |
| R2 | 0.046 |
| Adjusted R2 | 0.012 |
| Residual Std. Error | 267.560 (df = 28) |
| F Statistic | 1.352 (df = 1; 28) |
| Note: | P-value for slope: 0.254798 |
This time, the analysis indicated that while each additional bedroom was associated with a predicted $84.77 increase in price, the effect was not statistically significant (p = 0.255). Additionally, the model explained only 4.6% of the variation in price, suggesting that bedroom count is not a strong predictor of home price in this dataset.
| Dependent variable: | |
| Home Price | |
| Number of Bathrooms | 194.739 (62.275) |
| Constant | 90.712 (148.571) |
| Observations | 30 |
| R2 | 0.259 |
| Adjusted R2 | 0.232 |
| Residual Std. Error | 235.838 (df = 28) |
| F Statistic | 9.779 (df = 1; 28) |
| Note: | P-value for slope: 0.004092 |
The linear regression conducted to evaluate the effect of the number of bathrooms on home price showed that each additional bathroom was associated with an increase of approximately $194.74 in home price, the relationship was statistically significant (p = 0.0041), and the model explained about 26% of the variance in home prices. This suggests that the number of bathrooms is an important factor in predicting home value.
| Dependent variable: | |
| Home Price | |
| Size | 0.281 (0.119) |
| Number of Bedrooms | -33.704 (67.926) |
| Number of Bathrooms | 83.984 (76.753) |
| Constant | -41.561 (210.381) |
| Observations | 30 |
| R2 | 0.391 |
| Adjusted R2 | 0.321 |
| Residual Std. Error | 221.820 (df = 26) |
| F Statistic | 5.568 (df = 3; 26) |
| Note: | P-values — Size: 0.025857 , Bedrooms: 0.623933 , Bathrooms: 0.283894 |
A multiple regression was conducted to determine how size, number of bedrooms, and number of bathrooms jointly influence home price. The model explained 39.1% of the variance in home prices (Adjusted R² = 32.1%), and was statistically significant overall (F(3,26) = 5.568, p < 0.01).
Size had a statistically significant positive effect (p = 0.026), suggesting that larger homes tend to be more expensive.
Bedrooms did not show a significant effect (p = 0.624), indicating no clear price impact when controlling for size and bathrooms.
Bathrooms also did not reach statistical significance (p = 0.284), though the positive coefficient suggests a potential trend.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| State | 3 | 1198169 | 399389.58 | 7.354696 | 0.000148 |
| Residuals | 116 | 6299266 | 54304.02 | NA | NA |
There are statistically significant differences in average home prices among the four states (F(3,116) = 7.35, p < 0.001). As can be seen by the box plot, the median home price is closer to $600-thousand while the other states have a median between $200-$300-thousand.
As can be seen by the results of each analysis, there is a clear and statistically significant positive relationship between home size and price. The relationship between price and number of bedrooms or bathrooms is notably less statistically significant, but does still exhibit a positive relationship for both variables. It can also be seen that it is considerably more expensive to purchase a home in California compared to New Jersey, New York, and Pennsylvania, with the difference also being statistically significant.
Code snippets used in the analysis of each question:
```r Code Snippets
-------------------------------------------------------------------------------------------------------------------------------
Question 1:
model <- lm(Price ~ Size, data = ca_data)
pval <- summary(model)$coefficients[2,4]
stargazer(model, type = "html", title = "Regression Results",
dep.var.labels = "Home Price", covariate.labels = "Home Size (sqft)",
digits = 3, out = "model_output.html", star.cutoffs = NA,
notes = paste("P-value for slope:", round(pval, 6)),
notes.append = FALSE)
plot(ca_data$Size, ca_data$Price, main = "Price vs Size",
xlab = "Size (sq ft)", ylab = "Price", pch = 19, col = "blue")
abline(model, col = "red", lwd = 2)
-------------------------------------------------------------------------------------------------------------------------------
Question 2:
bedroom_model <- lm(Price ~ Beds, data = ca_data)
pval <- summary(bedroom_model)$coefficients[2,4]
stargazer(bedroom_model, type = "html", title = "Regression: Price ~ Bedrooms",
dep.var.labels = "Home Price", covariate.labels = "Number of Bedrooms",
digits = 3, single.row = TRUE, star.cutoffs = NA,
notes = paste("P-value for slope:", round(pval, 6)),
notes.append = FALSE)
plot(ca_data$Beds, ca_data$Price,
main = "Price vs Number of Bedrooms",
xlab = "Number of Bedrooms", ylab = "Price",
pch = 19, col = "darkblue")
abline(bedroom_model, col = "red", lwd = 2)
-------------------------------------------------------------------------------------------------------------------------------
Question 3:
bathroom_model <- lm(Price ~ Baths, data = ca_data)
pval <- summary(bathroom_model)$coefficients[2,4]
stargazer(bathroom_model, type = "html", title = "Regression: Price ~ Bathrooms",
dep.var.labels = "Home Price", covariate.labels = "Number of Bathrooms",
digits = 3, single.row = TRUE, star.cutoffs = NA,
notes = paste("P-value for slope:", round(pval, 6)),
notes.append = FALSE)
plot(ca_data$Beds, ca_data$Price,
main = "Price vs Number of Bathrooms",
xlab = "Number of Bathrooms", ylab = "Price",
pch = 19, col = "darkblue")
abline(bathroom_model, col = "red", lwd = 2)
-------------------------------------------------------------------------------------------------------------------------------
Question 4:
multi_model <- lm(Price ~ Size + Beds + Baths, data = ca_data)
pvals <- summary(multi_model)$coefficients[, 4]
p_note <- paste("P-values — Size: ", round(pvals["Size"], 6),
", Bedrooms: ", round(pvals["Beds"], 6),
", Bathrooms: ", round(pvals["Baths"], 6))
stargazer(multi_model,
type = "html", # Use "text" for console or "latex" for PDF output
title = "Multiple Regression: Price ~ Size + Bedrooms + Bathrooms",
dep.var.labels = "Home Price",
covariate.labels = c("Size", "Number of Bedrooms", "Number of Bathrooms"),
digits = 3,
single.row = TRUE,
star.cutoffs = NA, # disables asterisks
notes = p_note,
notes.append = FALSE)
plot(multi_model, which = 1,
main = "Residuals vs Fitted",
col = "darkblue", pch = 19)
-------------------------------------------------------------------------------------------------------------------------------
Question 5:
library(knitr)
anova_model <- aov(Price ~ State, data = home_data)
kable(summary(anova_model)[[1]],
caption = "ANOVA Table: Differences in Home Prices by State",
digits = 6)
boxplot(Price ~ State, data = home_data,
main = "Home Prices by State",
xlab = "State", ylab = "Price",
col = "lightblue", border = "gray40")
-------------------------------------------------------------------------------------------------------------------------------
```