This report focuses on analyzing the housing market using data from homes listed for sale across four states: California, New York, New Jersey, and Pennsylvania. The dataset, which includes 120 observations and five variables, was originally sourced from Zillow in 2019 and is hosted on Lock5Stat.com. Most of the analysis centers on California, with comparisons made to the other three states later in the report. Throughout this project, we apply various statistical methods learned in STAT 353: Statistical Methods I for Engineering, including simple and multiple linear regression, p-value interpretation, and one-way ANOVA. We also use diagnostic tools to check regression assumptions and conduct a post-hoc Tukey HSD test to better understand differences between states.
Use the data only for California. How much does the size of a home influence its price?
Use the data only for California. How does the number of bedrooms of a home influence its price?
Use the data only for California. How does the number of bathrooms of a home influence its price?
Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.
college = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(college)
## State Price Size Beds Baths
## 1 CA 533 1589 3 2.5
## 2 CA 610 2008 3 2.0
## 3 CA 899 2380 5 3.0
## 4 CA 929 1868 3 3.0
## 5 CA 210 1360 2 2.0
## 6 CA 268 2131 3 2.0
We will explore the chosen questions for analysis in detail below:
# Load the dataset from the web
college <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
# Filter the data for California
college_CA <- subset(college, State == "CA")
# Fit a simple linear regression: Price ~ Size
model1 <- lm(Price ~ Size, data = college_CA)
# Show the regression summary
summary(model1)
# Plot the data and regression line
plot(college_CA$Size, college_CA$Price,
main = "Home Price vs. Size in California",
xlab = "Size (sq ft)", ylab = "Price ($)",
pch = 19, col = "steelblue")
abline(model1, col = "darkred", lwd = 2)
Above, we used simple linear regression to examine how the size of a home (in square feet) influences its price for homes in California. The regression model estimates the average increase in price for each additional square foot of size.As we can see, the slope coefficient tells us how much price is expected to change per unit increase in size.A small p-value (< 0.05) for the Size variable means the relationship is statistically significant, and size has a meaningful impact on price. The scatterplot with the regression line shows a clear upward trend, supporting a positive linear relationship between size and price — larger homes generally cost more.
# Load necessary libraries
library(ggplot2)
library(car) # for QQ plot
library(lmtest) # for additional diagnostic tests
library(MASS) # for standardized residuals
# Diagnostic Plots (built-in in R)
par(mfrow = c(2, 2)) # 2x2 plot layout
plot(model1) # model1 was created in Q1
par(mfrow = c(1, 1)) # reset to default
# Optionally: Cook's Distance threshold and plot
influential_threshold <- 4 / length(model1$fitted.values)
cooksd <- cooks.distance(model1)
plot(cooksd, type = "h", main = "Cook's Distance",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = influential_threshold, col = "red", lty = 2)
legend("topright", legend = paste("Threshold =", round(influential_threshold, 4)),
col = "red", lty = 2)
# Load the dataset from the web
college <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
# Filter for California homes only
college_CA <- subset(college, State == "CA")
# Fit a simple linear regression: Price ~ Beds
model2 <- lm(Price ~ Beds, data = college_CA)
# Show the regression summary
summary(model2)
# Scatterplot with regression line
plot(college_CA$Beds, college_CA$Price,
main = "Home Price vs. Number of Bedrooms in California",
xlab = "Number of Bedrooms", ylab = "Price ($)",
pch = 19, col = "darkgreen")
abline(model2, col = "red", lwd = 2)
We used simple linear regression to investigate whether the number of bedrooms (Beds) influences home price in California. The regression output provides both the slope coefficient and its p-value.A statistically significant p-value (less than 0.05) indicates that the number of bedrooms has a meaningful impact on price.The slope tells us the average change in price for each additional bedroom, holding all else constant.
# Load necessary libraries
library(ggplot2)
library(car) # for QQ plot
library(lmtest) # for diagnostic tests
library(MASS) # for residual analysis
# Diagnostic Plots for model2 (Price ~ Beds)
par(mfrow = c(2, 2)) # Set up 2x2 plot layout
plot(model2) # Built-in diagnostic plots
par(mfrow = c(1, 1)) # Reset layout
# Optional: Cook’s Distance plot
influential_threshold <- 4 / length(model2$fitted.values)
cooksd <- cooks.distance(model2)
plot(cooksd, type = "h", main = "Cook's Distance for Model 2",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = influential_threshold, col = "red", lty = 2)
legend("topright", legend = paste("Threshold =", round(influential_threshold, 4)),
col = "red", lty = 2)
# Load dataset from the web
college <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
# Filter California homes
college_CA <- subset(college, State == "CA")
# Fit linear regression: Price ~ Baths
model3 <- lm(Price ~ Baths, data = college_CA)
# Show regression summary
summary(model3)
# Plot Price vs Baths with regression line
plot(college_CA$Baths, college_CA$Price,
main = "Home Price vs. Number of Bathrooms in California",
xlab = "Number of Bathrooms", ylab = "Price ($)",
pch = 19, col = "darkblue")
abline(model3, col = "orange", lwd = 2)
We modeled the relationship between the number of
bathrooms and home price using simple
linear regression.
If the p-value for Baths is less than 0.05, it indicates a
statistically significant influence on price.
Example: A slope of $30,000 means that each additional bathroom increases the average home price by about $30,000.
# Load necessary libraries
library(ggplot2)
library(car) # for QQ plot
library(lmtest) # for diagnostic tests
library(MASS) # for standardized residuals
# Regression diagnostics for model3
par(mfrow = c(2, 2)) # Set up 2x2 layout
plot(model3) # Default linear model diagnostics
par(mfrow = c(1, 1)) # Reset layout
# Cook's Distance plot
influential_threshold <- 4 / length(model3$fitted.values)
cooksd <- cooks.distance(model3)
plot(cooksd, type = "h", main = "Cook's Distance for Model 3",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = influential_threshold, col = "red", lty = 2)
legend("topright", legend = paste("Threshold =", round(influential_threshold, 4)),
col = "red", lty = 2)
# Load the dataset from the web
college <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
# Filter for California homes only
college_CA <- subset(college, State == "CA")
# Fit a multiple linear regression: Price ~ Size + Beds + Baths
model4 <- lm(Price ~ Size + Beds + Baths, data = college_CA)
# Show the regression summary
summary(model4)
# Optional: Scatterplot matrix to visualize relationships
pairs(~ Price + Size + Beds + Baths, data = college_CA,
main = "Scatterplot Matrix: Home Features vs Price in California",
pch = 19, col = "steelblue")
We conducted a multiple linear regression to assess how size, number of bedrooms, and number of bathrooms collectively influence home prices in California. The analysis revealed that size is a statistically significant predictor, indicating that larger homes tend to have higher prices. In contrast, bedrooms and bathrooms did not show significant individual effects when accounting for size, suggesting potential multicollinearity among these variables. Overall, size emerged as the most reliable factor in predicting home prices within this model and the boxplot above visually confirms that California not only has higher median home prices but also a tighter distribution, with some exceptionally high-priced homes.
# Load necessary libraries
library(ggplot2)
library(car) # For QQ plot
library(lmtest) # For additional diagnostic tests
library(MASS) # For standardized residuals
# Diagnostic Plots (built-in in R)
par(mfrow = c(2, 2)) # Set up 2x2 plot layout
plot(model4) # model4 was created in Q4
par(mfrow = c(1, 1)) # Reset to default layout
# Cook's Distance threshold and plot
influential_threshold <- 4 / length(model4$fitted.values)
cooksd <- cooks.distance(model4)
plot(cooksd, type = "h", main = "Cook's Distance for Model 4",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = influential_threshold, col = "red", lty = 2)
legend("topright", legend = paste("Threshold =", round(influential_threshold, 4)),
col = "red", lty = 2)
# Load the dataset
college <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
# Filter for the four states of interest
four_states <- subset(college, State %in% c("CA", "NY", "NJ", "PA"))
# Run one-way ANOVA: Price ~ State
model5 <- aov(Price ~ State, data = four_states)
# Show ANOVA summary
summary(model5)
# Visualize with boxplot
boxplot(Price ~ State, data = four_states,
main = "Home Prices by State",
xlab = "State", ylab = "Price ($)",
col = c("steelblue", "orange", "forestgreen", "purple"),
border = "black")
A one-way ANOVA was conducted to determine if there are significant differences in average home prices among California, New York, New Jersey, and Pennsylvania. The analysis yielded an F-statistic of 7.355 with a p-value of 0.000148. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that at least one state’s average home price differs significantly from the others. The accompanying boxplot visually supports this finding, showing that California has a higher median home price and a tighter price distribution compared to the other states.
# Load necessary libraries
library(ggplot2)
library(car) # for QQ plot
library(lmtest) # for diagnostic tests
library(MASS) # for residuals
# Regression diagnostics for model5 (ANOVA is a special case of linear model)
par(mfrow = c(2, 2)) # 2x2 layout
plot(model5) # Default diagnostic plots
par(mfrow = c(1, 1)) # Reset layout
# Cook's Distance
influential_threshold <- 4 / length(model5$fitted.values)
cooksd <- cooks.distance(model5)
plot(cooksd, type = "h", main = "Cook's Distance for ANOVA Model (Q5)",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = influential_threshold, col = "red", lty = 2)
legend("topright", legend = paste("Threshold =", round(influential_threshold, 4)),
col = "red", lty = 2)
This report presents the results of our analysis on housing data from California, New York, New Jersey, and Pennsylvania. Each research question focused on identifying how different features of a home relate to its price, primarily within California.
Size and Price (CA): We found that for every additional 1,000 square feet, the average home price in California increases by about $339.20. The relationship is statistically significant, but the R-squared value suggests that size alone doesn’t explain all the variation in price.
Bedrooms and Price (CA): On average, each additional bedroom adds $84,767.30 to the price of a home. However, the p-value for this model was not statistically significant, and the R-squared value was low, meaning bedrooms don’t strongly predict price on their own.
Bathrooms and Price (CA): Each extra bathroom increases the price by about $194,739.10, and this result is statistically significant. Still, the model’s R-squared shows that other factors are likely influencing price as well.
Size, Bedrooms, and Bathrooms Combined (CA): When we used all three variables together, the model showed below:
A $281.10 increase for every 1,000 square feet,
A $33,703.60 decrease per additional bedroom,
An $83,984.40 increase per additional bathroom. However, only the size variable was statistically significant, and the results suggest possible multicollinearity between inputs.
Overall, this project helped us explore some of the key factors that influence housing prices in California. Out of the variables we tested, home size was the most consistent and reliable predictor of price. Other variables like bedrooms and bathrooms had some influence, but the models suggest we’d need more detailed data (such as types of bathrooms or square footage per bedroom) to build stronger predictions. Since California homes tend to be priced much higher than those in other states, it could be valuable to explore what makes the housing market there so different. These findings may help inform future decisions related to urban planning, housing policies, and economic development.
Lock, R., Lock, P., Morgan, K., Lock, E., & Lock, D. (2020). HomesForSale [Dataset]. https://www.lock5stat.com/datapage3e.html
Lock, R., Lock, P., Morgan, K., Lock, E., & Lock, D. (2020). Dataset documentation for the third edition of “Statistics: UnLocking the Power of Data.” Wiley. https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf