This project is a comprehensive analysis of homes across the United States and the physical attributes of these homes that affect their value. Attributes such as: size, location, number of beds, number of baths, and others. Within the dataset provided, we have 5 variables over 120 observations.
To analyze the data and learn from it, we ask the following 5 questions: 1. Use the data only for California. How much does the size of a home influence its price?
Use the data only for California. How does the number of bedrooms of a home influence its price?
Use the data only for California. How does the number of bathrooms of a home influence its price?
Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?
Our data set comes from the file “https://www.lock5stat.com/datasets3e/HomesForSale.csv” which is found in “https://www.lock5stat.com/datapage3e.html.” It contains 120 observations and 5 variables to make use of in this analysis.
Our goal in this report is to understand the various prices of housing and what affects these prices most significantly. We can hope to obtain a realistic understanding of the information at hand in such a way that it can be applied in the real world.
##
## Call:
## lm(formula = Price ~ Size, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
## 2.5 % 97.5 %
## (Intercept) -373.6664578 260.0329614
## Size 0.1638888 0.5144945
Based on the model used, the estimated slope is .339. And the p-value is .00046. Since the price is being measured in thousands of dollars and the size in thousands of square feet, the asking price of a home increases by about $339,000 per 1,000 square feet. Additionally, the p-value reflects this by suggesting the size has a statistically significant impact on the price of a home. Furthermore, the plot provided shows the same analysis. When observing the \(R^2\) value of .359, we can conclude that about 36% of the variation in home price in CA is from size alone.
##
## Call:
## lm(formula = Price ~ Beds, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
## 2.5 % 97.5 %
## (Intercept) -208.78172 748.3065
## Beds -64.58336 234.1180
The data analyzed suggests that the price of a home increases by $84,000 per bedroom added. However, the p-value is .225. Meaning the addition of bedrooms is not statistically significant when determining the price of a home in CA. Additionally, the plot suggests prices range greatly. The cause of price increase is more likely attributed to houses with more bedrooms having more total square footage. Which, based on question 1’s analysis, we know to be statistically impactful.
##
## Call:
## lm(formula = Price ~ Baths, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
## 2.5 % 97.5 %
## (Intercept) -213.62183 395.0466
## Baths 67.17425 322.3040
The estimated slope for this analysis is 194.74. So, for each bathroom added, the price increases by about $195,000 on average. What’s interesting is that the p-value is .004. Meaning the number of bathrooms is statistically significant when determining the price of a home in CA and accounts for about 26% of the home price variance. This is interesting since the number of bedrooms was irrelevant, and most homes usually have a number of bathrooms proportional to the number of bedrooms.
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
## 2.5 % 97.5 %
## (Intercept) -474.00498463 390.8832902
## Size 0.03663397 0.5255712
## Beds -173.32645947 105.9193266
## Baths -73.78359026 241.7524094
## Loading required package: carData
## Size Beds Baths
## 1.886944 1.262775 1.717079
Both the number of bedrooms and number of bathrooms have a p-value that suggests that they are not relevant to the price of a home, once size is accounted for. Looking a the p-value for size, we can conclude that size is the supreme factor in determining the statistically significant price changes in this case. Once again, the full model shows 39% variance in cost of homes in CA, which is determined mostly by the size. The combined model also explains why the significance determined in questions 2 and 3 seem to have an effect that makes little sense.
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Price ~ State, data = home)
##
## $State
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951
## State Price.mean Price.sd Price.n
## 1 CA 535.3667 269.1774 30.0000
## 2 NJ 328.5333 157.9731 30.0000
## 3 NY 365.3333 317.8217 30.0000
## 4 PA 265.5667 137.0894 30.0000
W can see the p-vale for “state” is .000148. Which is much lower then .05. Meaning the state a home is in is very significant in determining it’s price. Calculating the \(R^2\) value, we get .16. Which can be interpreted to show that, on average, 16% of a home’s value is determined by the stat it’s located in. Additionally, the box plot shown demonstrates the price ranges and average values of homes in the four states within the data set.
This project analyzed how various physical characteristics of homes influence asking prices, with a specific focus on homes located in California and a broader comparison across four U.S. states (CA, NY, NJ, and PA). Regression and ANOVA methods were used to evaluate the strength and statistical significance of these relationships.
Overall, the analysis indicates that home size is the primary driver of housing prices, particularly within California. While features such as bedrooms and bathrooms may initially appear to influence price, their effects diminish once size is accounted for. Additionally, geographic location has a statistically significant impact on home prices, though it explains a moderate portion of overall variability. These findings emphasize the importance of considering multiple variables together when evaluating real estate prices.
D2L Assignment (Instruction set) – Dr. Zhang’s Project Video (Set up reference)
https://www.lock5stat.com/datapage3e.html (All data page) – ://www.lock5stat.com/datasets3e/HomesForSale.csv (home data)
Chat GPT (Code troubleshooting)
Full Analysis Code
``` r
# Q1
ca <- subset(home, home$State == "CA")
# Fit simple linear regression
fit_size <- lm(Price ~ Size, data = ca)
summary(fit_size)
##
## Call:
## lm(formula = Price ~ Size, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
confint(fit_size)
## 2.5 % 97.5 %
## (Intercept) -373.6664578 260.0329614
## Size 0.1638888 0.5144945
# Plot with regression line
plot(ca$Size, ca$Price, xlab="Size (1000s sq ft)", ylab="Price ($1000s)",
main="CA: Price vs Size")
abline(fit_size)
# Q2
fit_beds <- lm(Price ~ Beds, data = ca)
summary(fit_beds)
##
## Call:
## lm(formula = Price ~ Beds, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
confint(fit_beds)
## 2.5 % 97.5 %
## (Intercept) -208.78172 748.3065
## Beds -64.58336 234.1180
# Plot
plot(ca$Beds, ca$Price, xlab="Beds", ylab="Price ($1000s)",
main="CA: Price vs Beds")
abline(fit_beds)
# Q3
fit_baths <- lm(Price ~ Baths, data = ca)
summary(fit_baths)
##
## Call:
## lm(formula = Price ~ Baths, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
confint(fit_baths)
## 2.5 % 97.5 %
## (Intercept) -213.62183 395.0466
## Baths 67.17425 322.3040
# Plot
plot(ca$Baths, ca$Price, xlab="Baths", ylab="Price ($1000s)",
main="CA: Price vs Baths")
abline(fit_baths)
# Q4
fit_multi <- lm(Price ~ Size + Beds + Baths, data = ca)
summary(fit_multi)
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
confint(fit_multi)
## 2.5 % 97.5 %
## (Intercept) -474.00498463 390.8832902
## Size 0.03663397 0.5255712
## Beds -173.32645947 105.9193266
## Baths -73.78359026 241.7524094
# Diagnostics for multiple regression
par(mfrow=c(2,2))
plot(fit_multi) # residuals, QQ-plot, scale-location, leverage
# Check multicollinearity (optional)
# install.packages("car") # if needed
library(car)
vif(fit_multi)
## Size Beds Baths
## 1.886944 1.262775 1.717079
# Q5
fit_state <- aov(Price ~ State, data = home)
summary(fit_state)
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# If overall ANOVA is significant, run pairwise comparisons (Tukey)
TukeyHSD(fit_state)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Price ~ State, data = home)
##
## $State
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951
# Check group means and boxplot
aggregate(Price ~ State, data = home, FUN = function(x) c(mean=mean(x), sd=sd(x), n=length(x)))
## State Price.mean Price.sd Price.n
## 1 CA 535.3667 269.1774 30.0000
## 2 NJ 328.5333 157.9731 30.0000
## 3 NY 365.3333 317.8217 30.0000
## 4 PA 265.5667 137.0894 30.0000
boxplot(Price ~ State, data = home, main="Price by State", ylab="Price ($1000s)")