Introduction

This project analyzes real estate data to understand how home features and location impact house prices. Specifically, we focus on California to explore how size, number of bedrooms, and number of bathrooms affect price. We also see whether price significantly varies between four states (CA, NY, NJ, PA).

Data

home <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0
summary(home)
##     State               Price             Size           Beds      
##  Length:120         Min.   :  35.0   Min.   : 540   Min.   :1.000  
##  Class :character   1st Qu.: 193.8   1st Qu.:1429   1st Qu.:3.000  
##  Mode  :character   Median : 287.0   Median :1737   Median :3.000  
##                     Mean   : 373.7   Mean   :1863   Mean   :3.358  
##                     3rd Qu.: 519.5   3rd Qu.:2094   3rd Qu.:4.000  
##                     Max.   :1250.0   Max.   :4286   Max.   :6.000  
##      Baths    
##  Min.   :1.0  
##  1st Qu.:2.0  
##  Median :2.0  
##  Mean   :2.4  
##  3rd Qu.:3.0  
##  Max.   :5.0

The dataset includes 120 homes across CA, NY, NJ, and PA with the following variables:

State: Home location

Price: Asking price (in $1,000s)

Size: Area (in 1,000’s sq. ft.)

Beds: Number of bedrooms

Baths: Number of bathrooms

Analysis

We will explore the questions in detail.

(1) Use the data only for California. How much does the size of a home influence its price?

ca <- subset(home, State == "CA")
model1 <- lm(Price ~ Size, data = ca)
summary(model1)
## 
## Call:
## lm(formula = Price ~ Size, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634
plot(ca$Size, ca$Price, main="Price vs Size (CA)", xlab="Size (sq ft)", ylab="Price ($1000s)", col="blue")
abline(model1, col="red")

As homes get bigger, the price usually goes up. The line in the graph shows this clearly. The slope of the model is significant, so size has a strong effect on price.

(2) Use the data only for California. How does the number of bedrooms of a home influence its price?

model2 <- lm(Price ~ Beds, data = ca)
summary(model2)
## 
## Call:
## lm(formula = Price ~ Beds, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548
plot(ca$Beds, ca$Price, main="Price vs Bedrooms (CA)", xlab="Number of Bedrooms", ylab="Price ($1000s)", col="darkgreen")
abline(model2, col="red")

The dots are all over the place, especially for 3-bedroom homes. The line goes up slightly, but it’s not strong. So, bedrooms don’t have a clear impact on price.

(3) Use the data only for California. How does the number of bathrooms of a home influence its price?

model3 <- lm(Price ~ Baths, data = ca)
summary(model3)
## 
## Call:
## lm(formula = Price ~ Baths, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092
plot(ca$Baths, ca$Price, main="Price vs Bathrooms (CA)", xlab="Number of Bathrooms", ylab="Price ($1000s)", col="purple")
abline(model3, col="red")

The more bathrooms a house has, the higher the price tends to be. The line on the graph shows a clear upward trend. Bathrooms have a real impact on price.

(4) Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

model4 <- lm(Price ~ Size + Beds + Baths, data = ca)
summary(model4)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
termplot(model4, se=TRUE, col.term="red", main="Effect of Predictors on Price")

This graph shows how each factor affects price when we consider them all at once. Size and bathrooms still matter, but bedrooms don’t change much. This confirms what we saw in earlier graphs.

(5) Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

model5 <- aov(Price ~ State, data = home)
summary(model5)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(Price ~ State, data = home, main="Home Prices by State", ylab="Price ($1000s)", col=c("red", "green", "blue", "orange"))

The boxplot shows that homes in California cost the most, and homes in PA cost the least. The p-value tells us this difference between states is real and significant.

Summary

Our analysis has shown:

-Size of homes has a strong, positive effect on price, clearly visible in the trend line.

-Number of bedrooms does not show a strong relationship with price.

-Number of bathrooms is a meaningful predictor. More bathrooms generally raise the price.

-In a multiple regression, size and bathrooms retain their importance; bedrooms remain weak.

-State of the home significantly affects price. California homes are the most expensive on average.

These results support the idea that both physical features and location are important in determining home value. Regression and ANOVA helped uncover which factors truly influence prices.

References

Lock5Stat Dataset: HomesForSale (2019) https://www.lock5stat.com/datapage3e.html

Zillow.com reference data for 2019 home listings

Appendix

Q1: Price vs Size (CA)

ca <- subset(home, State == “CA”) model1 <- lm(Price ~ Size, data = ca) summary(model1) plot(ca\(Size, ca\)Price, main=“Price vs Size (CA)”, xlab=“Size (sq ft)”, ylab=“Price ($1000s)”, col=“blue”) abline(model1, col=“red”)

Q2: Price vs Bedrooms (CA)

model2 <- lm(Price ~ Beds, data = ca) summary(model2) plot(ca\(Beds, ca\)Price, main=“Price vs Bedrooms (CA)”, xlab=“Number of Bedrooms”, ylab=“Price ($1000s)”, col=“darkgreen”) abline(model2, col=“red”)

Q3: Price vs Bathrooms (CA)

model3 <- lm(Price ~ Baths, data = ca) summary(model3) plot(ca\(Baths, ca\)Price, main=“Price vs Bathrooms (CA)”, xlab=“Number of Bathrooms”, ylab=“Price ($1000s)”, col=“purple”) abline(model3, col=“red”)

Q4: Multiple Regression (Size + Beds + Baths)

model4 <- lm(Price ~ Size + Beds + Baths, data = ca) summary(model4) termplot(model4, se=TRUE, col.term=“red”, main=“Effect of Predictors on Price”)

Q5: Price by State (ANOVA)

model5 <- aov(Price ~ State, data = home) summary(model5) boxplot(Price ~ State, data = home, main=“Home Prices by State”, ylab=“Price ($1000s)”, col=c(“red”, “green”, “blue”,