Introduction

This project is a comprehensive analysis of homes across the United States and the physical attributes of these homes that affect their value. Attributes such as: size, location, number of beds, number of baths, and others. Within the dataset provided, we have 5 variables over 120 observations.

To analyze the data and learn from it, we ask the following 5 questions: 1. Use the data only for California. How much does the size of a home influence its price?

  1. Use the data only for California. How does the number of bedrooms of a home influence its price?

  2. Use the data only for California. How does the number of bathrooms of a home influence its price?

  3. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

  4. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

Data

Our data set comes from the file “https://www.lock5stat.com/datasets3e/HomesForSale.csv” which is found in “https://www.lock5stat.com/datapage3e.html.” It contains 120 observations and 5 variables to make use of in this analysis.

Analysis

Our goal in this report is to understand the various prices of housing and what affects these prices most significantly. We can hope to obtain a realistic understanding of the information at hand in such a way that it can be applied in the real world.

Q1: How much does the size of a home influence its price? (CA only)

## 
## Call:
## lm(formula = Price ~ Size, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634
##                    2.5 %      97.5 %
## (Intercept) -373.6664578 260.0329614
## Size           0.1638888   0.5144945

Based on the model used, the estimated slope is .339. And the p-value is .00046. Since the price is being measured in thousands of dollars and the size in thousands of square feet, the asking price of a home increases by about $339,000 per 1,000 square feet. Additionally, the p-value reflects this by suggesting the size has a statistically significant impact on the price of a home. Furthermore, the plot provided shows the same analysis. When observing the \(R^2\) value of .359, we can conclude that about 36% of the variation in home price in CA is from size alone.

Q2: How does the number of bedrooms of a home influence its price? (CA only)

## 
## Call:
## lm(formula = Price ~ Beds, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548
##                  2.5 %   97.5 %
## (Intercept) -208.78172 748.3065
## Beds         -64.58336 234.1180

The data analyzed suggests that the price of a home increases by $84,000 per bedroom added. However, the p-value is .225. Meaning the addition of bedrooms is not statistically significant when determining the price of a home in CA. Additionally, the plot suggests prices range greatly. The cause of price increase is more likely attributed to houses with more bedrooms having more total square footage. Which, based on question 1’s analysis, we know to be statistically impactful.

Q3: How does the number of bathrooms of a home influence its price? (CA only)

## 
## Call:
## lm(formula = Price ~ Baths, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092
##                  2.5 %   97.5 %
## (Intercept) -213.62183 395.0466
## Baths         67.17425 322.3040

The estimated slope for this analysis is 194.74. So, for each bathroom added, the price increases by about $195,000 on average. What’s interesting is that the p-value is .004. Meaning the number of bathrooms is statistically significant when determining the price of a home in CA and accounts for about 26% of the home price variance. This is interesting since the number of bedrooms was irrelevant, and most homes usually have a number of bathrooms proportional to the number of bedrooms.

Q4: How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price? (CA only)

## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
##                     2.5 %      97.5 %
## (Intercept) -474.00498463 390.8832902
## Size           0.03663397   0.5255712
## Beds        -173.32645947 105.9193266
## Baths        -73.78359026 241.7524094

## Loading required package: carData
##     Size     Beds    Baths 
## 1.886944 1.262775 1.717079

Both the number of bedrooms and number of bathrooms have a p-value that suggests that they are not relevant to the price of a home, once size is accounted for. Looking a the p-value for size, we can conclude that size is the supreme factor in determining the statistically significant price changes in this case. Once again, the full model shows 39% variance in cost of homes in CA, which is determined mostly by the size. The combined model also explains why the significance determined in questions 2 and 3 seem to have an effect that makes little sense.

Q5: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = home)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951
##   State Price.mean Price.sd  Price.n
## 1    CA   535.3667 269.1774  30.0000
## 2    NJ   328.5333 157.9731  30.0000
## 3    NY   365.3333 317.8217  30.0000
## 4    PA   265.5667 137.0894  30.0000

W can see the p-vale for “state” is .000148. Which is much lower then .05. Meaning the state a home is in is very significant in determining it’s price. Calculating the \(R^2\) value, we get .16. Which can be interpreted to show that, on average, 16% of a home’s value is determined by the stat it’s located in. Additionally, the box plot shown demonstrates the price ranges and average values of homes in the four states within the data set.

Summary

This project analyzed how various physical characteristics of homes influence asking prices, with a specific focus on homes located in California and a broader comparison across four U.S. states (CA, NY, NJ, and PA). Regression and ANOVA methods were used to evaluate the strength and statistical significance of these relationships.

Key Findings

  • Home size was the most influential factor in determining price for California homes. The regression model showed that for every additional 1,000 square feet, the expected home price increased by approximately $339,000. Size alone explained about 36% of the variability in California home prices.
  • The number of bedrooms was not a statistically significant predictor of price in California. Although homes with more bedrooms tended to have higher prices, the p-value indicated that bedrooms do not independently explain price differences.
  • The number of bathrooms was statistically significant when analyzed individually. Each additional bathroom increased the expected price by approximately $195,000 and explained about 26% of the variation in price.
  • In the multiple regression model, which included size, number of bedrooms, and number of bathrooms, only size remained statistically significant. This suggests that the apparent influence of bedrooms and bathrooms is largely explained through their relationship with overall square footage.
  • The ANOVA comparing home prices across states revealed statistically significant differences among CA, NY, NJ, and PA (p = 0.000148). The resulting \(R^2\) value of approximately 0.16 indicates that state location explains about 16% of the total variation in home prices.

Overall Conclusion

Overall, the analysis indicates that home size is the primary driver of housing prices, particularly within California. While features such as bedrooms and bathrooms may initially appear to influence price, their effects diminish once size is accounted for. Additionally, geographic location has a statistically significant impact on home prices, though it explains a moderate portion of overall variability. These findings emphasize the importance of considering multiple variables together when evaluating real estate prices.

References

D2L Assignment (Instruction set) – Dr. Zhang’s Project Video (Set up reference)

https://www.lock5stat.com/datapage3e.html (All data page) – ://www.lock5stat.com/datasets3e/HomesForSale.csv (home data)

Chat GPT (Code troubleshooting)

Appendix

Full Analysis Code


``` r
# Q1
ca <- subset(home, home$State == "CA")
# Fit simple linear regression
fit_size <- lm(Price ~ Size, data = ca)
summary(fit_size)
## 
## Call:
## lm(formula = Price ~ Size, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634
confint(fit_size)
##                    2.5 %      97.5 %
## (Intercept) -373.6664578 260.0329614
## Size           0.1638888   0.5144945
# Plot with regression line
plot(ca$Size, ca$Price, xlab="Size (1000s sq ft)", ylab="Price ($1000s)",
     main="CA: Price vs Size")
abline(fit_size)

# Q2
fit_beds <- lm(Price ~ Beds, data = ca)
summary(fit_beds)
## 
## Call:
## lm(formula = Price ~ Beds, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548
confint(fit_beds)
##                  2.5 %   97.5 %
## (Intercept) -208.78172 748.3065
## Beds         -64.58336 234.1180
# Plot
plot(ca$Beds, ca$Price, xlab="Beds", ylab="Price ($1000s)",
     main="CA: Price vs Beds")
abline(fit_beds)

# Q3
fit_baths <- lm(Price ~ Baths, data = ca)
summary(fit_baths)
## 
## Call:
## lm(formula = Price ~ Baths, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092
confint(fit_baths)
##                  2.5 %   97.5 %
## (Intercept) -213.62183 395.0466
## Baths         67.17425 322.3040
# Plot
plot(ca$Baths, ca$Price, xlab="Baths", ylab="Price ($1000s)",
     main="CA: Price vs Baths")
abline(fit_baths)

# Q4
fit_multi <- lm(Price ~ Size + Beds + Baths, data = ca)
summary(fit_multi)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
confint(fit_multi)
##                     2.5 %      97.5 %
## (Intercept) -474.00498463 390.8832902
## Size           0.03663397   0.5255712
## Beds        -173.32645947 105.9193266
## Baths        -73.78359026 241.7524094
# Diagnostics for multiple regression
par(mfrow=c(2,2))
plot(fit_multi)  # residuals, QQ-plot, scale-location, leverage

# Check multicollinearity (optional)
# install.packages("car") # if needed
library(car)
vif(fit_multi)
##     Size     Beds    Baths 
## 1.886944 1.262775 1.717079
# Q5
fit_state <- aov(Price ~ State, data = home)
summary(fit_state)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# If overall ANOVA is significant, run pairwise comparisons (Tukey)
TukeyHSD(fit_state)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = home)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951
# Check group means and boxplot
aggregate(Price ~ State, data = home, FUN = function(x) c(mean=mean(x), sd=sd(x), n=length(x)))
##   State Price.mean Price.sd  Price.n
## 1    CA   535.3667 269.1774  30.0000
## 2    NJ   328.5333 157.9731  30.0000
## 3    NY   365.3333 317.8217  30.0000
## 4    PA   265.5667 137.0894  30.0000
boxplot(Price ~ State, data = home, main="Price by State", ylab="Price ($1000s)")