This project analyzes real estate data to understand how home features and location impact house prices. Specifically, we focus on California to explore how size, number of bedrooms, and number of bathrooms affect price. We also see whether price significantly varies between four states (CA, NY, NJ, PA).
home <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home)
## State Price Size Beds Baths
## 1 CA 533 1589 3 2.5
## 2 CA 610 2008 3 2.0
## 3 CA 899 2380 5 3.0
## 4 CA 929 1868 3 3.0
## 5 CA 210 1360 2 2.0
## 6 CA 268 2131 3 2.0
summary(home)
## State Price Size Beds
## Length:120 Min. : 35.0 Min. : 540 Min. :1.000
## Class :character 1st Qu.: 193.8 1st Qu.:1429 1st Qu.:3.000
## Mode :character Median : 287.0 Median :1737 Median :3.000
## Mean : 373.7 Mean :1863 Mean :3.358
## 3rd Qu.: 519.5 3rd Qu.:2094 3rd Qu.:4.000
## Max. :1250.0 Max. :4286 Max. :6.000
## Baths
## Min. :1.0
## 1st Qu.:2.0
## Median :2.0
## Mean :2.4
## 3rd Qu.:3.0
## Max. :5.0
The dataset includes 120 homes across CA, NY, NJ, and PA with the following variables:
State: Home location
Price: Asking price (in $1,000s)
Size: Area (in 1,000’s sq. ft.)
Beds: Number of bedrooms
Baths: Number of bathrooms
We will explore the questions in detail.
ca <- subset(home, State == "CA")
model1 <- lm(Price ~ Size, data = ca)
summary(model1)
##
## Call:
## lm(formula = Price ~ Size, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
plot(ca$Size, ca$Price, main="Price vs Size (CA)", xlab="Size (sq ft)", ylab="Price ($1000s)", col="blue")
abline(model1, col="red")
As homes get bigger, the price usually goes up. The line in the graph
shows this clearly. The slope of the model is significant, so size has a
strong effect on price.
model2 <- lm(Price ~ Beds, data = ca)
summary(model2)
##
## Call:
## lm(formula = Price ~ Beds, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
plot(ca$Beds, ca$Price, main="Price vs Bedrooms (CA)", xlab="Number of Bedrooms", ylab="Price ($1000s)", col="darkgreen")
abline(model2, col="red")
The dots are all over the place, especially for 3-bedroom homes. The
line goes up slightly, but it’s not strong. So, bedrooms don’t have a
clear impact on price.
model3 <- lm(Price ~ Baths, data = ca)
summary(model3)
##
## Call:
## lm(formula = Price ~ Baths, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
plot(ca$Baths, ca$Price, main="Price vs Bathrooms (CA)", xlab="Number of Bathrooms", ylab="Price ($1000s)", col="purple")
abline(model3, col="red")
The more bathrooms a house has, the higher the price tends to be. The
line on the graph shows a clear upward trend. Bathrooms have a real
impact on price.
model4 <- lm(Price ~ Size + Beds + Baths, data = ca)
summary(model4)
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
termplot(model4, se=TRUE, col.term="red", main="Effect of Predictors on Price")
This graph shows how each factor affects price when we consider them all
at once. Size and bathrooms still matter, but bedrooms don’t change
much. This confirms what we saw in earlier graphs.
model5 <- aov(Price ~ State, data = home)
summary(model5)
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(Price ~ State, data = home, main="Home Prices by State", ylab="Price ($1000s)", col=c("red", "green", "blue", "orange"))
The boxplot shows that homes in California cost the most, and homes in
PA cost the least. The p-value tells us this difference between states
is real and significant.
Our analysis has shown:
-Size of homes has a strong, positive effect on price, clearly visible in the trend line.
-Number of bedrooms does not show a strong relationship with price.
-Number of bathrooms is a meaningful predictor. More bathrooms generally raise the price.
-In a multiple regression, size and bathrooms retain their importance; bedrooms remain weak.
-State of the home significantly affects price. California homes are the most expensive on average.
These results support the idea that both physical features and location are important in determining home value. Regression and ANOVA helped uncover which factors truly influence prices.
Lock5Stat Dataset: HomesForSale (2019) https://www.lock5stat.com/datapage3e.html
Zillow.com reference data for 2019 home listings
ca <- subset(home, State == “CA”) model1 <- lm(Price ~ Size, data = ca) summary(model1) plot(ca\(Size, ca\)Price, main=“Price vs Size (CA)”, xlab=“Size (sq ft)”, ylab=“Price ($1000s)”, col=“blue”) abline(model1, col=“red”)
model2 <- lm(Price ~ Beds, data = ca) summary(model2) plot(ca\(Beds, ca\)Price, main=“Price vs Bedrooms (CA)”, xlab=“Number of Bedrooms”, ylab=“Price ($1000s)”, col=“darkgreen”) abline(model2, col=“red”)
model3 <- lm(Price ~ Baths, data = ca) summary(model3) plot(ca\(Baths, ca\)Price, main=“Price vs Bathrooms (CA)”, xlab=“Number of Bathrooms”, ylab=“Price ($1000s)”, col=“purple”) abline(model3, col=“red”)
model4 <- lm(Price ~ Size + Beds + Baths, data = ca) summary(model4) termplot(model4, se=TRUE, col.term=“red”, main=“Effect of Predictors on Price”)
model5 <- aov(Price ~ State, data = home) summary(model5) boxplot(Price ~ State, data = home, main=“Home Prices by State”, ylab=“Price ($1000s)”, col=c(“red”, “green”, “blue”,