Introduction

The HomesForSale dataset consists of 120 observations across 5 key variables related to residential real estate listings. These variables include the state in which the home is located (CA, NJ, NY, or PA), the asking price of the home (in $1,000’s), the total size of the home (in 1,000’s of square feet), and the number of bedrooms and bathrooms.

This dataset provides a useful foundation for exploring how various features of a home—such as its size, number of bedrooms, and number of bathrooms—contribute to its market price, as well as whether geographic location plays a role in pricing differences. Because housing markets often vary widely by region and home features, these questions are relevant for both buyers and sellers seeking to understand pricing trends.

The following research questions are addressed in this report:

By analyzing these questions through regression and ANOVA techniques, this report aims to uncover how individual home features and location affect asking prices in the housing market.

Data

The HomesForSale dataset consists of 120 observations across 5 variables related to the characteristics of homes for sale. These variables span key dimensions of real estate, including location, price, size, and the number of bedrooms and bathrooms in each home. Key variables include:

Analysis

We will explore the questions in detail.

home <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0
str(home)
## 'data.frame':    120 obs. of  5 variables:
##  $ State: chr  "CA" "CA" "CA" "CA" ...
##  $ Price: int  533 610 899 929 210 268 1095 699 729 700 ...
##  $ Size : int  1589 2008 2380 1868 1360 2131 2436 1375 2013 1371 ...
##  $ Beds : int  3 3 5 3 2 3 3 2 3 3 ...
##  $ Baths: num  2.5 2 3 3 2 2 2 1 4 2 ...

Q1: Use the data only for California. How much does the size of a home influence its price?

To explore the relationship between the size of a home and its asking price in California, we fit a simple linear regression model.

## 
## Call:
## lm(formula = Price ~ Size, data = CA_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634

The slope estimate for Size is 0.339, meaning that for each additional 1,000 sq. ft., the home price increases by $339,000, on average, in California.

The p-value for Size is 0.000463, which is highly significant (p < 0.001). This means there is strong evidence that home size is positively associated with price.

The R-squared is 0.359, meaning about 36% of the variation in home prices can be explained by size alone.

Q2: Use the data only for California. How does the number of bedrooms of a home influence its price?

Next, we investigate how the number of bedrooms influences the price of a home in California.

The slope is 84.77, meaning each additional bedroom adds about $84,770 to the price, on average.

However, the p-value for Beds is 0.255, which is not statistically significant (p > 0.05), so we do not have enough evidence to conclude that the number of bedrooms has a real effect on price.

Only ~4.6% of the price variation is explained by bedroom count (R² = 0.046), which is low.

Q3: Use the data only for California. How does the number of bathrooms of a home influence its price?

The slope is 194.74, suggesting each additional bathroom is associated with an average price increase of $194,740.

The p-value is 0.004, which is statistically significant (p < 0.01), indicating strong evidence of a positive relationship between bathrooms and price.

The R² = 0.259, meaning around 26% of the price variation is explained by bathroom count.

Q4: Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

Only Size has a statistically significant slope (p = 0.0259), confirming its importance in predicting price.

Beds (p = 0.624) and Baths (p = 0.284) are not significant in the presence of Size, possibly due to multicollinearity or overlapping explanatory power.

The model explains about 39% of the variation in price (R² = 0.391), and the overall model is significant (p = 0.00435), meaning the combined variables are useful for prediction.

Q5: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for the State factor is 0.000148, which is highly significant (p < 0.001).

This means there is very strong evidence that the average home price differs by state.

The ANOVA F-test shows that at least one state mean is significantly different from the others.

Summary

Across all analyses, home size consistently shows a strong and statistically significant relationship with price, especially in California. While bathrooms also show significance when analyzed alone, their impact becomes non-significant when controlling for size and bedrooms. Bedroom count was not a significant predictor of price in any model. Furthermore, significant differences in average home prices exist between states, as revealed by the ANOVA. These results highlight that home size and location (state) are critical factors influencing real estate prices, whereas bedroom and bathroom counts may not independently explain much variance when size is already considered.

References

# Lock5Stat Dataset reference
# https://www.lock5stat.com/datapage3e.html

# Q1: Use the data only for California. How much does the size of a home influence its price?
# Subset data for California
# CA_data <- subset(home, State == "CA")
# Fit the linear regression model
# model1 <- lm(Price ~ Size, data = CA_data)
# Display the summary of the model
# summary(model1)
# Plot Size vs Price with the regression line
# plot(CA_data$Size, CA_data$Price, main = "Size vs Price in California", xlab = "Size (1,000 sq. ft.)", ylab = "Price ($1,000)")
# abline(model1, col="blue")

# Q2: Use the data only for California. How does the number of bedrooms of a home influence its price?
# Fit the linear regression model for Bedrooms vs Price
# model2 <- lm(Price ~ Beds, data = CA_data)
# Display the summary of the model
# summary(model2)
# Call:
# lm(formula = Price ~ Beds, data = CA_data)
# Residuals:
#     Min      1Q  Median      3Q     Max 
#-413.83 -236.62   29.94  197.69  570.94 
# Coefficients:
#             Estimate Std. Error t value
# (Intercept)   269.76     233.62   1.155
# Beds           84.77      72.91   1.163
#             Pr(>|t|)
# (Intercept)    0.258
# Beds           0.255
# Residual standard error: 267.6 on 28 degrees of freedom
# Multiple R-squared:  0.04605, Adjusted R-squared:  0.01198 
# F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548

# Q3: Use the data only for California. How does the number of bathrooms of a home influence its price?
# Fit the linear regression model for Bathrooms vs Price
# model3 <- lm(Price ~ Baths, data = CA_data)
# Display the summary of the model
# summary(model3)
# Call:
# lm(formula = Price ~ Baths, data = CA_data)
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -374.93 -181.56   -2.74  152.31  614.81 
# Coefficients:
#             Estimate Std. Error t value
# (Intercept)    90.71     148.57   0.611
# Baths         194.74      62.28   3.127
#             Pr(>|t|)   
# (Intercept)  0.54641   
# Baths        0.00409 **
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 235.8 on 28 degrees of freedom
# Multiple R-squared:  0.2588,  Adjusted R-squared:  0.2324 
# F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092

# Q4: Use the data only for California. How do the size,  the number of bedrooms, and  the number of bathrooms of a home jointly influence its price? 
# Fit the multiple regression model
# model4 <- lm(Price ~ Size + Beds + Baths, data = CA_data)
# Display the summary of the model
# summary(model4)
# Call:
# lm(formula = Price ~ Size + Beds + Baths, data = CA_data)
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -415.47 -130.32   19.64  154.79  384.94 
# Coefficients:
#             Estimate Std. Error t value
# (Intercept) -41.5608   210.3809  -0.198
# Size          0.2811     0.1189   2.364
# Beds        -33.7036    67.9255  -0.496
# Baths        83.9844    76.7530   1.094
#             Pr(>|t|)  
# (Intercept)   0.8449  
# Size          0.0259 *
# Beds          0.6239  
# Baths         0.2839  
# Signif. codes:  
# 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 221.8 on 26 degrees of freedom
# Multiple R-squared:  0.3912,  Adjusted R-squared:  0.3209 
# F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353

# Q5: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?
# Fit the ANOVA model for Price by State
# anova_model <- aov(Price ~ State, data = home)
# Display the summary of the ANOVA model
# summary(anova_model)
# Boxplot to visualize price differences by state
# boxplot(Price ~ State, data = home, main = "Home Prices by State", ylab = "Price ($1,000)", col = "lightblue")