The purpose of this report is to analyze the HomesForSale dataset, which contains data on home characteristics and prices across four states: California, New York, New Jersey, and Pennsylvania. This dataset includes variables such as home price, size, number of bedrooms, and number of bathrooms. By examining these variables, we aim to understand the factors that influence home prices and assess whether location, specifically the state in which a home is located, has a significant impact on price.
In this analysis, a series of statistical techniques will be applied, including regression analysis and ANOVA, to explore questions related to the effects of home size, bedrooms, bathrooms, and state location on price. These methods will allow us to test for significant relationships and differences, providing insights into the primary drivers of home prices. The ultimate goal is to identify meaningful trends that can help in understanding the real estate market across different regions.
We began our analysis by exploring the HomesForSale
dataset, which contains data on home prices and characteristics across
four states: California, New York, New Jersey, and Pennsylvania. This
dataset was sourced from https://www.lock5stat.com/datasets3e/HomesForSale.csv.
It includes variables such as home price, size (in square feet), number
of bedrooms, and number of bathrooms, allowing us to investigate factors
that may influence home prices and whether these factors vary
significantly by location.
The analysis was conducted in R software, and five primary questions were formulated to guide our investigation. The first question was, “How much does the size of a home influence its price in California?” We addressed this by fitting a simple linear regression model with Price as the response variable and Size as the predictor. This allowed us to interpret the impact of home size on price in this specific state.
The second question was, “How does the number of bedrooms influence the price of a home in California?” To answer this, another linear regression model was fitted using Beds as the predictor variable. This helped us assess whether the number of bedrooms significantly affects home prices in California.
Our third question, “How does the number of bathrooms influence home price in California?” was similarly analyzed by fitting a regression model with Baths as the predictor. This approach enabled us to isolate the effect of bathrooms on price, holding other variables constant.
For the fourth question, “How do size, number of bedrooms, and number of bathrooms jointly influence home price in California?” we employed a multiple regression analysis using Size, Beds, and Baths as predictors. This provided insights into the combined impact of these variables on home prices while controlling for each factor.
Finally, to answer the fifth question, “Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?” we conducted a one-way ANOVA. This method allowed us to test for statistically significant differences in mean home prices across states, determining if location has a substantial impact on price.
Throughout this analysis, we used regression and ANOVA methods to derive meaningful insights into the factors influencing home prices, with each question addressing a specific aspect of the data.
##
## Call:
## lm(formula = Price ~ Size, data = california_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
## State Price Size
## 1 CA 533 1589
## 2 CA 610 2008
## 3 CA 899 2380
## 4 CA 929 1868
## 5 CA 210 1360
## 6 CA 268 2131
We begin our analysis with the first question: “How much does the size of a home influence its price in California?” To address this, we performed a simple linear regression using home size as the predictor variable and home price as the response variable.
The model results indicate a statistically significant positive relationship between home size and price. Specifically, the coefficient for Size is 0.33919 (p-value = 0.0004634), suggesting that for every additional unit increase in home size, the price of the home increases by approximately 0.339 units on average. The small p-value indicates that this relationship is highly significant.
The model’s R-squared value is 0.3594, meaning that approximately 35.94% of the variability in home prices can be explained by the size of the home. While this indicates that size is a meaningful factor in predicting home prices, there are likely other variables contributing to price variations.
Overall, the analysis provides strong evidence that home size is positively associated with home prices in California, but additional factors should also be considered for a more comprehensive understanding. A scatter plot was also created to visually support this finding.
The scatter plot above displays the relationship between home size and price for homes in California. Each blue point represents an individual home, with the x-axis indicating the size of the home and the y-axis showing the price. The red regression line demonstrates the positive trend identified by the linear regression model, suggesting that as the size of a home increases, the price also tends to increase. This visual supports the model’s finding of a significant positive relationship between home size and price, indicating that home size is an important factor in determining home prices in California. However, the spread of points around the regression line also suggests some variability, implying that other factors may influence home prices as well.
##
## Call:
## lm(formula = Price ~ Beds, data = california_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
## State Price Beds
## 1 CA 533 3
## 2 CA 610 3
## 3 CA 899 5
## 4 CA 929 3
## 5 CA 210 2
## 6 CA 268 3
Our second question examines how the number of bedrooms in a home influences its price in California. To address this, we performed a simple linear regression with the number of bedrooms as the predictor and home price as the response variable.
The analysis resulted in a coefficient of 84.77 for the number of bedrooms, with a p-value of 0.255. This suggests that the relationship between the number of bedrooms and home price is not statistically significant. Additionally, the R-squared value of 0.04605 indicates that only about 4.61% of the variation in home prices can be explained by the number of bedrooms alone.
In summary, the results imply that the number of bedrooms does not have a strong or significant impact on home prices in California. The large residual standard error (267.6) and low R-squared value highlight the need to consider additional factors to better understand the variability in home prices. We can once again, visualise our results by displaying a scatter plot that is shown below.
The scatter plot above illustrates the relationship between the number of bedrooms and home price for properties in California. Each blue dot represents a home, with the x-axis showing the number of bedrooms and the y-axis depicting the home price. The red regression line indicates a positive trend, suggesting that home prices tend to increase as the number of bedrooms rises. However, the spread of points around the regression line indicates a substantial amount of variability, implying that the number of bedrooms alone may not be a strong predictor of home price. This visual aligns with our regression analysis, which found the relationship to be statistically insignificant.
##
## Call:
## lm(formula = Price ~ Baths, data = california_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
## State Price Beds
## 1 CA 533 3
## 2 CA 610 3
## 3 CA 899 5
## 4 CA 929 3
## 5 CA 210 2
## 6 CA 268 3
We continue our analysis with our third question, which investigates how the number of bathrooms in a home influences its price in California. A simple linear regression was conducted with the number of bathrooms as the predictor variable and home price as the response variable.
The results show a significant positive relationship between the number of bathrooms and home price. Specifically, the coefficient for Baths is 194.74, with a p-value of 0.00409, indicating that the association is statistically significant at the 0.01 level. This suggests that, on average, each additional bathroom is associated with an increase of approximately 194.74 units in home price.
The R-squared value of 0.2588 indicates that around 25.88% of the variation in home prices can be explained by the number of bathrooms, suggesting a moderate relationship. However, the residual standard error of 235.8 implies that there is still considerable variability in home prices that is not accounted for by the number of bathrooms alone.
Overall, this analysis provides evidence that the number of bathrooms is a significant factor in predicting home prices, although other factors likely contribute to the overall variability in home prices. To visually see this, we can use a box plot to compare the data and further solidify our results.
The box plot above displays the distribution of home prices in California categorized by the number of bathrooms. Each box represents the interquartile range (IQR) of home prices for a specific number of bathrooms, with the thick horizontal line indicating the median price. The whiskers extend to show the range of prices within 1.5 times the IQR, and any data points outside this range would be considered outliers.
The plot reveals that homes with more bathrooms generally have higher median prices. For example, homes with three bathrooms tend to have higher prices compared to homes with one or two bathrooms. Additionally, the variation in home prices appears to be larger for homes with one and two bathrooms, as shown by the greater spread of the boxes and whiskers, while homes with 2.5 or 4 bathrooms have narrower price ranges. This visualization supports the regression analysis indicating that the number of bathrooms is a significant factor in predicting home prices.
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = california_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
## State Price Baths Beds
## 1 CA 533 3 3
## 2 CA 610 3 3
## 3 CA 899 5 5
## 4 CA 929 3 3
## 5 CA 210 2 2
## 6 CA 268 3 3
Question 4 explores the question, “Using the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?” To address this, we conducted a multiple linear regression analysis with home price as the response variable and size, number of bedrooms, and number of bathrooms as the predictor variables.
The regression results indicate that home size is a significant predictor of price, with a coefficient of 0.2811 and a p-value of 0.0259, suggesting that as the size of a home increases, the price tends to increase significantly. On the other hand, the number of bedrooms and bathrooms did not show a significant relationship with home price, with p-values of 0.6239 and 0.2839, respectively.
The model’s R-squared value is 0.3912, indicating that approximately 39.12% of the variation in home prices can be explained by the combined influence of size, number of bedrooms, and number of bathrooms. The residual standard error is 221.8, reflecting the variation in price that is not captured by the model. The overall model is statistically significant, as indicated by the F-statistic (5.568) and the p-value of 0.004353.
In summary, while home size is a significant factor in predicting price, the number of bedrooms and bathrooms does not appear to have a significant impact when considered together with size. This suggests that size may be the primary driver of home price among these variables. We can visualize the results by generating a scatter plot matrix which is shown below and comment on the graph.
The scatter plot matrix above shows the pairwise relationships between home price, size, number of bedrooms, and number of bathrooms for homes in California. The diagonal panels display the density plots for each variable, providing insights into their distribution.
The off-diagonal scatter plots illustrate how each pair of variables is related, with accompanying correlation coefficients displayed. The correlation between home price and size is moderately strong at 0.599 (p < 0.001), indicating a positive relationship where larger homes tend to have higher prices. The number of bathrooms also shows a significant positive correlation with home price (0.509, p < 0.01). However, the number of bedrooms has a weaker correlation with home price (0.215), suggesting a less pronounced relationship.
The correlation between size and the number of bathrooms is relatively strong at 0.643 (p < 0.001), implying that larger homes generally have more bathrooms. The relationship between size and the number of bedrooms is moderate, with a correlation of 0.449 (p < 0.05). Overall, the matrix provides a comprehensive view of the relationships among the variables, highlighting that home size and the number of bathrooms are more strongly associated with home price than the number of bedrooms.
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## State Price
## 1 CA 533
## 2 CA 610
## 3 CA 899
## 4 CA 929
## 5 CA 210
## 6 CA 268
Our last question asks, “Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?” To address this, we performed an Analysis of Variance (ANOVA) to compare home prices across these states.
The ANOVA results indicate that there are indeed significant differences in home prices among the four states. The F-statistic is 7.355, with a p-value of 0.000148, which is well below the conventional significance level of 0.05. This suggests that the state in which a home is located has a significant impact on its price.
The sum of squares for the “State” factor is 1,198,169, indicating the amount of variation in home prices explained by the differences among the states. The residual sum of squares is 6,299,266, representing the variation in home prices not explained by the state factor. Overall, the significant p-value highlights that the location (state) is a crucial factor in determining home prices. A box plot can be used to visualise these results.
The box plot above displays the distribution of home prices across the four states: California (CA), New Jersey (NJ), New York (NY), and Pennsylvania (PA).
From the plot, it is evident that California has the highest median home price and a wide range of prices, as indicated by the large interquartile range (IQR) and long whiskers. New York also shows a relatively broad distribution, with several outliers extending above the upper whisker, suggesting a few exceptionally high-priced homes. New Jersey and Pennsylvania have lower median home prices compared to California and New York, and their price ranges are more compact, with New Jersey showing the most limited variability.
Overall, the box plot visually supports the ANOVA results, indicating significant differences in home prices among the states, with California and New York exhibiting higher and more variable home prices compared to New Jersey and Pennsylvania.
In conclusion, our analysis of the HomesForSale dataset has provided valuable insights into the factors influencing home prices across different states. Through a series of regression and ANOVA analyses, we identified key findings:
Impact of Home Size on Price in California: Our results show a significant positive relationship between home size and price, with larger homes generally commanding higher prices. The regression analysis indicated that home size explains about 35.94% of the variability in home prices, underscoring its importance as a predictor.
Effect of Number of Bedrooms on Price in California: We found no significant relationship between the number of bedrooms and home price, suggesting that the number of bedrooms alone is not a strong determinant of price in California. The low R-squared value further highlights the limited explanatory power of this variable.
Influence of Number of Bathrooms on Price in California: The number of bathrooms, on the other hand, exhibited a significant positive relationship with home price. Homes with more bathrooms tend to have higher prices, although the effect is moderate, explaining 25.88% of the variability in home prices.
Joint Influence of Size, Bedrooms, and Bathrooms on Price in California: Our multiple regression analysis revealed that home size remains a significant predictor of price when considered alongside the number of bedrooms and bathrooms. However, neither the number of bedrooms nor bathrooms significantly contributed to price prediction when size was included in the model, indicating that home size may be the primary driver among these variables.
State Differences in Home Prices: Our ANOVA analysis confirmed that home prices differ significantly across the four states: California, New York, New Jersey, and Pennsylvania. California and New York have higher and more variable home prices compared to New Jersey and Pennsylvania, highlighting the influence of location on home prices.
Overall, our findings emphasize the importance of home size and state location as significant factors influencing home prices. While the number of bedrooms and bathrooms can also impact prices, their effects are less pronounced compared to home size. These results provide valuable insights for understanding regional differences in the real estate market and the key attributes that drive home prices. Future research could further explore additional variables, such as neighborhood amenities or proximity to urban centers, to gain a more comprehensive understanding of home price determinants.
home <- read.csv(“https://www.lock5stat.com/datasets3e/HomesForSale.csv”) 2. Question 1: Effect of Home Size on Price in California {r Copy code # Filter data for California only california_data <- home %>% filter(State == “CA”)
model <- lm(Price ~ Size, data = california_data)
summary(model) 3. Scatter Plot for Question 1 {r Copy code # Create a scatter plot of Size vs. Price with a regression line ggplot(california_data, aes(x = Size, y = Price)) + geom_point(color = “blue”, alpha = 0.6) + geom_smooth(method = “lm”, color = “red”, se = FALSE) + labs(title = “Scatter Plot of Home Size vs. Price in California”, x = “Size of Home”, y = “Price of Home”) + theme_minimal() 4. Question 2: Effect of Number of Bedrooms on Price in California {r Copy code # Fit a linear regression model: Price as a function of Beds model_beds <- lm(Price ~ Beds, data = california_data)
summary(model_beds) 5. Scatter Plot for Question 2 {r Copy code # Create a scatter plot of Beds vs. Price with a regression line ggplot(california_data, aes(x = Beds, y = Price)) + geom_point(color = “blue”, alpha = 0.6) + geom_smooth(method = “lm”, color = “red”, se = FALSE) + labs(title = “Scatter Plot of Number of Bedrooms vs. Price in California”, x = “Number of Bedrooms”, y = “Price of Home”) + theme_minimal() 6. Question 3: Effect of Number of Bathrooms on Price in California {r Copy code # Fit a linear regression model: Price as a function of Baths model_baths <- lm(Price ~ Baths, data = california_data)
summary(model_baths) 7. Box Plot for Question 3 {r Copy code # Create a box plot of Price by number of Bathrooms ggplot(california_data, aes(x = as.factor(Baths), y = Price)) + geom_boxplot(fill = “lightblue”, color = “darkblue”) + labs(title = “Box Plot of Home Price by Number of Bathrooms in California”, x = “Number of Bathrooms”, y = “Price of Home”) + theme_minimal() 8. Question 4: Joint Influence of Size, Bedrooms, and Bathrooms on Price {r Copy code # Fit a multiple linear regression model: Price as a function of Size, Beds, and Baths model_multiple <- lm(Price ~ Size + Beds + Baths, data = california_data)
summary(model_multiple) 9. Scatter Plot Matrix for Question 4 {r Copy code # Create a scatter plot matrix ggpairs(california_data, columns = c(“Price”, “Size”, “Beds”, “Baths”)) 10. Question 5: Differences in Home Prices Among States {r Copy code # Fit an ANOVA model to compare home prices across states anova_model <- aov(Price ~ State, data = home)
summary(anova_model) 11. Box Plot for Question 5 {r Copy code # Create a box plot of Price grouped by State ggplot(home, aes(x = State, y = Price)) + geom_boxplot(fill = “lightblue”, color = “darkblue”) + labs(title = “Box Plot of Home Prices by State”, x = “State”, y = “Price of Home”) + theme_minimal()