1. Introduction

This report explores the relationships between home prices and various attributes, including size, number of bedrooms, and number of bathrooms, using data from the “HomesForSale” dataset provided by Lock5Stat. Specifically, the analysis focuses on homes in California to determine how individual and combined factors influence price.

Additionally, the report evaluates whether the state in which a home is located significantly impacts its price by comparing homes across California (CA), New York (NY), New Jersey (NJ), and Pennsylvania (PA). The study uses regression models and ANOVA techniques to assess these relationships, offering insights into how key variables and state differences contribute to home pricing trends.

Following are the question that will be explored in this report:

Q1. Use the data only for California. How much does the size of a home influence its price?

Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?

Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?

Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

2. Data

The data for this analysis was sourced from the Lock5Stat website: HomesForSale.csv. The dataset includes the following variables:

Price: The price of the home (in thousands of dollars). Size: The size of the home (in square feet). Beds: The number of bedrooms in the home. Baths: The number of bathrooms in the home. State: The state in which the home is located (CA, NY, NJ, PA). For the analysis:

Q1–Q4: Data was filtered to include only homes in California. Q5: All data across the four states was used to compare prices. Statistical models, including simple linear regression, multiple regression, and ANOVA, were employed to analyze the data and interpret the relationships between variables.

# Load the dataset
home <-  read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")

# Inspect the dataset to understand its structure
str(home)
## 'data.frame':    120 obs. of  5 variables:
##  $ State: chr  "CA" "CA" "CA" "CA" ...
##  $ Price: int  533 610 899 929 210 268 1095 699 729 700 ...
##  $ Size : int  1589 2008 2380 1868 1360 2131 2436 1375 2013 1371 ...
##  $ Beds : int  3 3 5 3 2 3 3 2 3 3 ...
##  $ Baths: num  2.5 2 3 3 2 2 2 1 4 2 ...

3. Analysis

Here are the 5 questions in detail exploration.

home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0

Q1. Use the data only for California. How much does the size of a home influence its price?

california_homes <- subset(home, State == "CA")
 
model_size <- lm(Price ~ Size, data = california_homes)
summary(model_size)
## 
## Call:
## lm(formula = Price ~ Size, data = california_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634

Model Equation: Price = −56.82 + 0.3392 ⋅ Size

Interpretation: - For every additional square foot of home size, the price increases by approximately $0.34. - The slope is statistically significant (p = 0.000463), meaning that size has a real, measurable effect on price. - The model explains 35.94% of the variability in prices (R^2=0.3594), indicating that while size is important, other factors also influence price.

Conclusion: Home size significantly impacts price, with larger homes being more expensive.

Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?

california_homes <- subset(home, State == "CA")

model_beds <- lm(Price ~ Beds, data = california_homes)
summary(model_beds)
## 
## Call:
## lm(formula = Price ~ Beds, data = california_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548

Model Equation: Price = 269.76 + 84.77 ⋅ Beds

Interpretation: - The slope (84.77) suggests that each additional bedroom increases price by approximately $84.77, but the result is not statistically significant (p = 0.255). - The R^2 = 0.046 shows that only 4.6% of the variation in price is explained by the number of bedrooms.

Conclusion: The number of bedrooms does not significantly influence home prices in California.

Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?

california_homes <- subset(home, State == "CA")

model_baths <- lm(Price ~ Baths, data = california_homes)
summary(model_baths)
## 
## Call:
## lm(formula = Price ~ Baths, data = california_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092

Model Equation: Price = 90.71 + 194.74 ⋅ Baths

Interpretation: - Each additional bathroom increases the price by approximately $194.74, and this relationship is statistically significant (p = 0.00409). - The R^2 = 0.2588 indicates that 25.88% of the price variability is explained by the number of bathrooms.

Conclusion: Bathrooms significantly impact home prices, with more bathrooms associated with higher prices.

Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

california_homes <- subset(home, State == "CA")

model_joint <- lm(Price ~ Size + Beds + Baths, data = california_homes)
summary(model_joint)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = california_homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353

Model Equation: Price = −41.56 + 0.2811 ⋅ Size − 33.70 ⋅ Beds + 83.98 ⋅ Baths

Interpretation: - Size is the only significant predictor (p=0.0259), while bedrooms (p = 0.6239) and bathrooms (p = 0.2839) are not. - The model explains 39.12% of the variability in price (R^2 = 0.3912), higher than individual models.

Conclusion: When considering size, bedrooms, and bathrooms together, size remains the strongest and only statistically significant predictor of price.

Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

model_anova <- aov(Price ~ State, data = home)
summary(model_anova)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model_anova)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = home)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951

ANOVA Results: The p-value (p = 0.000148) indicates significant differences in home prices among the states.

Tukey’s post-hoc test shows: - California has significantly higher prices than NJ, NY, and PA. - Differences between NJ, NY, and PA are not statistically significant.

Conclusion: The state significantly impacts home prices. California homes are more expensive than those in NJ, NY, and PA.

4. Visualisations

Q1. Scatter plot of Effect of Size on Price (California Only) with regression line.

plot(california_homes$Size, california_homes$Price, 
     main = "Price vs. Size (California)", 
     xlab = "Size (sq. ft.)", ylab = "Price (in $1,000s)", pch = 19, col = "blue")
abline(lm(Price ~ Size, data = california_homes), col = "red", lwd = 2)

Q2. Scatter plot of Effect of Bedrooms on Price (California Only) with regression line.

plot(california_homes$Beds, california_homes$Price, 
     main = "Price vs. Bedrooms (California)", 
     xlab = "Number of Bedrooms", ylab = "Price (in $1,000s)", pch = 19, col = "blue")
abline(lm(Price ~ Beds, data = california_homes), col = "red", lwd = 2)

Q3. Scatter plot of Effect of Bathrooms on Price (California Only) with regression line.

plot(california_homes$Baths, california_homes$Price, 
     main = "Price vs. Bathrooms (California)", 
     xlab = "Number of Bathrooms", ylab = "Price (in $1,000s)", pch = 19, col = "blue")
abline(lm(Price ~ Baths, data = california_homes), col = "red", lwd = 2)

Q5. Boxplot comparing prices across states:

boxplot(Price ~ State, data = home, 
        main = "Boxplot of Home Prices by State", 
        xlab = "State", ylab = "Price (in $1,000s)", 
        col = c("lightblue", "lightgreen", "pink", "yellow"))

5. Summary

The analysis reveals that home price in California is primarily influenced by size, with additional bathrooms adding value to some extent. Statewise comparisons indicate California’s real estate market is significantly more expensive than NY, NJ, or PA. This study provides useful insights for buyers, sellers, and policymakers in understanding key factors affecting home prices and regional disparities.

6. Reference

Lock5 Statistics. Homes for Sale Data (Version 3e), Lock5Stat, 2024, www.lock5stat.com/datasets3e/HomesForSale.csv.