1. Introduction

This analysis seeks to understand how home characteristics (size, bedrooms, bathrooms) affect pricing in California and to evaluate price differences across four states.

The objectives of this analysis are to understand the individual and joint contributions of home characteristics to pricing in California and to evaluate differences in home prices across four states (CA, NY, NJ, PA). By addressing these questions, this study seeks to provide a deeper understanding of the real estate market and the factors driving price variations.

Using R for statistical analysis and visualization, we aim to extract meaningful insights from the dataset. The findings from this study will help contextualize the influence of home attributes and location on pricing, offering valuable information for potential buyers, sellers, and policymakers.

The following research questions guide this report:

  1. To what extent does the size of a home influence its price in California?
  2. How does the number of bedrooms influence home prices in California?
  3. How does the number of bathrooms influence home prices in California?
  4. How do size, number of bedrooms, and number of bathrooms jointly impact home prices in California?
  5. Are there significant differences in home prices among the states of CA, NY, NJ, and PA?

2. Analysis

In this section, we will delve into the proposed questions in detail, focusing on the “HomesForSale” dataset to analyze factors influencing home prices. We will conduct a thorough investigation of the dataset, exploring how home size, the number of bedrooms, and the number of bathrooms affect prices in California, as well as examining price differences across states.

By applying regression models, ANOVA, and visualizations, we aim to derive meaningful insights into the relationships between home characteristics, location, and pricing. These statistical methods will enable us to interpret the data effectively and draw accurate conclusions that address our research questions.

home = read.csv("HomesForSale.csv")
head(home)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0
# Subset for California homes only
home_CA <- subset(home, State == "CA")

Question 1: To what extent does the size of a home influence its price in California?

# Fit a linear regression model
model_size <- lm(Price ~ Size, data = home_CA)

# Summary of the model
summary(model_size)
## 
## Call:
## lm(formula = Price ~ Size, data = home_CA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634
# Scatterplot with regression line
plot(home_CA$Size, home_CA$Price, 
     main = "Relationship Between Home Size and Price in California",
     xlab = "Home Size (sq. ft.)", 
     ylab = "Price ($)", 
     pch = 16, 
     col = "blue")

# Add regression line
abline(model_size, col = "red", lwd = 2)

The scatterplot above illustrates the relationship between home size (in square feet) and price (in dollars) for homes in California, with a red line representing the fitted regression model.

Insights:

  • Slope (0.33919): Each additional square foot of home size, the price increases by $0.34.
  • Residual standard error (219.3): On average, the predicted price differs from the actual price by $219.30.
  • R-squared (0.3594): About 35.94% of the variation in home prices is explained by home size.
  • p-value (0.000463): The p-value is very small, indicating a statistically significant relationship between home size and price.

Question 2: How does the number of bedrooms influence home prices in California?

# Fit the regression model for Price vs. Number of Bedrooms
model_bedrooms <- lm(Price ~ Beds, data = home_CA)

# Summary of the model
summary(model_bedrooms)
## 
## Call:
## lm(formula = Price ~ Beds, data = home_CA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548
# Plotting the relationship between Bedrooms and Price
plot(home_CA$Beds, home_CA$Price, main="Home Price vs. Number of Bedrooms",
     xlab="Number of Bedrooms", ylab="Price ($1,000)", pch=19, col="blue")
abline(model_bedrooms, col="red")

The graph above illustrates the relationship between the number of bedrooms and home prices, showing the observed data points along with the fitted regression line.

Insights:

  • The “Estimate” (84.77) for the slope coefficient indicates that for each additional bedroom, the home price increases by approximately $84,770.
  • The “Residual standard error” (267.6) represents the average difference between observed and predicted home prices.
  • The R-squared value (0.04605) indicates that only 4.61% of the variation in home prices is explained by the number of bedrooms.
  • The p-value (0.255) associated with the slope coefficient indicates that the relationship between the number of bedrooms and home price is not statistically significant at the 0.05 significance level.

Question 3: How does the number of bathrooms influence home prices in California?

# Subset the data for California homes
home_CA_baths <- home[home$State == "CA", ]

# Fit the regression model
model_baths <- lm(Price ~ Baths, data = home_CA_baths)

# View the summary of the model
summary(model_baths)
## 
## Call:
## lm(formula = Price ~ Baths, data = home_CA_baths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092
# Plot the relationship between number of bathrooms and price with blue color for both plot and line
plot(home_CA_baths$Baths, home_CA_baths$Price, 
     xlab = "Number of Bathrooms", ylab = "Price (in $1,000)", 
     main = "Number of Bathrooms vs. Home Price", col = "blue", pch = 16)
abline(model_baths, col = "red")

The scatter plot shows the relationship between the number of bathrooms and home price for homes in California, with a red regression line indicating the trend. The plot provides a visual representation of how the number of bathrooms relates to home prices in the dataset.

Insights: - The Estimate (194.74) for the slope coefficient indicates that for each additional bathroom, the home price increases by approximately $194,740. - The Residual standard error (235.8) represents the average difference between the observed home prices and the predicted prices, which suggests a moderate level of prediction error. - The R-squared value (0.2588) indicates that 25.88% of the variation in home prices is explained by the number of bathrooms, suggesting a more significant relationship than in previous questions. - The p-value (0.00409) for the slope coefficient indicates that the relationship between the number of bathrooms and home price is statistically significant at the 0.01 significance level, meaning the number of bathrooms is a meaningful predictor of home prices.

Question 4: How do size, number of bedrooms, and number of bathrooms jointly impact home prices in California?

# Subset the data for California homes
home_CA_multiple <- home[home$State == "CA", ]

# Fit the multiple regression model
model_multiple <- lm(Price ~ Size + Beds + Baths, data = home_CA_multiple)

# View the summary of the model
summary(model_multiple)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = home_CA_multiple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
  • The coefficient 0.2811 for Size means that for each additional 1,000 sq. ft. of home size, the price increases by about $281.1 (in thousands of dollars), holding Beds and Baths constant. The p-value (0.0259) indicates that this relationship is statistically significant.
  • The coefficient -33.7036 for Beds means that for each additional bedroom, the price decreases by $33.7 (in thousands of dollars), holding Size and Baths constant. However, the p-value (0.6239) is large, indicating that this relationship is not statistically significant.
  • The coefficient 83.9844 for Baths means that for each additional bathroom, the price increases by $84 (in thousands of dollars), holding Size and Beds constant. However, the p-value (0.2839) is large, suggesting that this relationship is not statistically significant.
  • The p-value for the Intercept (0.8449) is high, meaning the intercept is not statistically significant, indicating that when all predictors are zero, the predicted price is not meaningful.

Question 5: Are there significant differences in home prices among the states of CA, NY, NJ, and PA?

# Subset the data for the four states
home_states <- home[home$State %in% c("CA", "NY", "NJ", "PA"), ]

# Perform ANOVA to compare home prices across states
anova_model <- aov(Price ~ State, data = home_states)

# View the summary of the ANOVA model
summary(anova_model)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Boxplot to visualize the differences in home prices across states
boxplot(Price ~ State, data = home_states, 
        main = "Home Prices by State", 
        xlab = "State", ylab = "Price (in $1,000)", 
        col = c("lightblue", "lightgreen", "lightpink", "lightyellow"))

The boxplot illustrates the distribution of home prices across the states of California, New York, New Jersey, and Pennsylvania, showing the median, interquartile range, and potential outliers for each state.

3. Summary

This analysis explored factors influencing home prices in California and compared these factors across four states: California (CA), New York (NY), New Jersey (NJ), and Pennsylvania (PA). By applying linear regression and ANOVA techniques, we examined how key home attributes—such as size, number of bedrooms, and number of bathrooms—impact home prices. The findings are summarized as follows:

  1. Home Size and Price in California: A significant positive relationship was found between home size and price in California. For every additional square foot of home size, the price increases by approximately $0.34. This relationship was statistically significant with an R-squared value of 35.94%, indicating that home size explains about 36% of the variation in home prices.

  2. Number of Bedrooms and Price in California: The relationship between the number of bedrooms and price was weak and not statistically significant. For every additional bedroom, the price increases by approximately $84,770, but the low R-squared value (4.61%) and the high p-value (0.255) suggest that the number of bedrooms does not significantly explain the variation in home prices.

  3. Number of Bathrooms and Price in California: A stronger relationship was found between the number of bathrooms and home price. For every additional bathroom, the price increases by approximately $194,740. This relationship was statistically significant (p-value = 0.004), with an R-squared value of 25.88%, indicating that the number of bathrooms is a meaningful predictor of home prices.

  4. Joint Impact of Size, Bedrooms, and Bathrooms on Price: When considering the joint effect of home size, number of bedrooms, and number of bathrooms, home size had a statistically significant positive impact on price, with a coefficient of $281.1 for every 1,000 square feet. However, the number of bedrooms and bathrooms did not show statistically significant effects in this multiple regression model, indicating that these factors, when considered together, may not provide additional explanatory power.

  5. Price Differences Across States: ANOVA testing revealed significant differences in home prices across the four states of CA, NY, NJ, and PA. Boxplot visualizations further confirmed these differences, highlighting the variation in home prices among the states.

Conclusion: This report highlights the significant influence of home size and the number of bathrooms on home prices in California. While the number of bedrooms was not found to have a strong impact, it is still an important factor to consider. Additionally, significant differences in home prices were observed between California and other states, suggesting that location plays a crucial role in determining home prices. These insights can assist home buyers, sellers, and real estate professionals in making informed decisions based on home attributes and location.

Reference

Lock5Stat. (n.d.). HomesForSale dataset. Retrieved from https://www.lock5stat.com/datapage3e.html

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.r-project.org.