This report explores the relationships between home prices and various attributes, including size, number of bedrooms, and number of bathrooms, using data from the “HomesForSale” dataset provided by Lock5Stat. Specifically, the analysis focuses on homes in California to determine how individual and combined factors influence price.
Additionally, the report evaluates whether the state in which a home is located significantly impacts its price by comparing homes across California (CA), New York (NY), New Jersey (NJ), and Pennsylvania (PA). The study uses regression models and ANOVA techniques to assess these relationships, offering insights into how key variables and state differences contribute to home pricing trends.
Following are the question that will be explored in this report:
Q1. Use the data only for California. How much does the size of a home influence its price?
Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?
Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?
Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.
The data for this analysis was sourced from the Lock5Stat website: HomesForSale.csv. The dataset includes the following variables:
Price: The price of the home (in thousands of dollars). Size: The size of the home (in square feet). Beds: The number of bedrooms in the home. Baths: The number of bathrooms in the home. State: The state in which the home is located (CA, NY, NJ, PA). For the analysis:
Q1–Q4: Data was filtered to include only homes in California. Q5: All data across the four states was used to compare prices. Statistical models, including simple linear regression, multiple regression, and ANOVA, were employed to analyze the data and interpret the relationships between variables.
# Load the dataset
home <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
# Inspect the dataset to understand its structure
str(home)
## 'data.frame': 120 obs. of 5 variables:
## $ State: chr "CA" "CA" "CA" "CA" ...
## $ Price: int 533 610 899 929 210 268 1095 699 729 700 ...
## $ Size : int 1589 2008 2380 1868 1360 2131 2436 1375 2013 1371 ...
## $ Beds : int 3 3 5 3 2 3 3 2 3 3 ...
## $ Baths: num 2.5 2 3 3 2 2 2 1 4 2 ...
Here are the 5 questions in detail exploration.
home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home)
## State Price Size Beds Baths
## 1 CA 533 1589 3 2.5
## 2 CA 610 2008 3 2.0
## 3 CA 899 2380 5 3.0
## 4 CA 929 1868 3 3.0
## 5 CA 210 1360 2 2.0
## 6 CA 268 2131 3 2.0
california_homes <- subset(home, State == "CA")
model_size <- lm(Price ~ Size, data = california_homes)
summary(model_size)
##
## Call:
## lm(formula = Price ~ Size, data = california_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
Model Equation: Price = −56.82 + 0.3392 ⋅ Size
Interpretation: - For every additional square foot of home size, the price increases by approximately $0.34. - The slope is statistically significant (p = 0.000463), meaning that size has a real, measurable effect on price. - The model explains 35.94% of the variability in prices (R^2=0.3594), indicating that while size is important, other factors also influence price.
Conclusion: Home size significantly impacts price, with larger homes being more expensive.
california_homes <- subset(home, State == "CA")
model_beds <- lm(Price ~ Beds, data = california_homes)
summary(model_beds)
##
## Call:
## lm(formula = Price ~ Beds, data = california_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
Model Equation: Price = 269.76 + 84.77 ⋅ Beds
Interpretation: - The slope (84.77) suggests that each additional bedroom increases price by approximately $84.77, but the result is not statistically significant (p = 0.255). - The R^2 = 0.046 shows that only 4.6% of the variation in price is explained by the number of bedrooms.
Conclusion: The number of bedrooms does not significantly influence home prices in California.
california_homes <- subset(home, State == "CA")
model_baths <- lm(Price ~ Baths, data = california_homes)
summary(model_baths)
##
## Call:
## lm(formula = Price ~ Baths, data = california_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
Model Equation: Price = 90.71 + 194.74 ⋅ Baths
Interpretation: - Each additional bathroom increases the price by approximately $194.74, and this relationship is statistically significant (p = 0.00409). - The R^2 = 0.2588 indicates that 25.88% of the price variability is explained by the number of bathrooms.
Conclusion: Bathrooms significantly impact home prices, with more bathrooms associated with higher prices.
california_homes <- subset(home, State == "CA")
model_joint <- lm(Price ~ Size + Beds + Baths, data = california_homes)
summary(model_joint)
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = california_homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
Model Equation: Price = −41.56 + 0.2811 ⋅ Size − 33.70 ⋅ Beds + 83.98 ⋅ Baths
Interpretation: - Size is the only significant predictor (p=0.0259), while bedrooms (p = 0.6239) and bathrooms (p = 0.2839) are not. - The model explains 39.12% of the variability in price (R^2 = 0.3912), higher than individual models.
Conclusion: When considering size, bedrooms, and bathrooms together, size remains the strongest and only statistically significant predictor of price.
model_anova <- aov(Price ~ State, data = home)
summary(model_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model_anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Price ~ State, data = home)
##
## $State
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951
ANOVA Results: The p-value (p = 0.000148) indicates significant differences in home prices among the states.
Tukey’s post-hoc test shows: - California has significantly higher prices than NJ, NY, and PA. - Differences between NJ, NY, and PA are not statistically significant.
Conclusion: The state significantly impacts home prices. California homes are more expensive than those in NJ, NY, and PA.
plot(california_homes$Size, california_homes$Price,
main = "Price vs. Size (California)",
xlab = "Size (sq. ft.)", ylab = "Price (in $1,000s)", pch = 19, col = "blue")
abline(lm(Price ~ Size, data = california_homes), col = "red", lwd = 2)
plot(california_homes$Beds, california_homes$Price,
main = "Price vs. Bedrooms (California)",
xlab = "Number of Bedrooms", ylab = "Price (in $1,000s)", pch = 19, col = "blue")
abline(lm(Price ~ Beds, data = california_homes), col = "red", lwd = 2)
plot(california_homes$Baths, california_homes$Price,
main = "Price vs. Bathrooms (California)",
xlab = "Number of Bathrooms", ylab = "Price (in $1,000s)", pch = 19, col = "blue")
abline(lm(Price ~ Baths, data = california_homes), col = "red", lwd = 2)
boxplot(Price ~ State, data = home,
main = "Boxplot of Home Prices by State",
xlab = "State", ylab = "Price (in $1,000s)",
col = c("lightblue", "lightgreen", "pink", "yellow"))
The analysis reveals that home price in California is primarily influenced by size, with additional bathrooms adding value to some extent. Statewise comparisons indicate California’s real estate market is significantly more expensive than NY, NJ, or PA. This study provides useful insights for buyers, sellers, and policymakers in understanding key factors affecting home prices and regional disparities.
Lock5 Statistics. Homes for Sale Data (Version 3e), Lock5Stat, 2024, www.lock5stat.com/datasets3e/HomesForSale.csv.