This Project will explore The value of homes in the states of California, New York, New Jersey, and Pennsylvania. The data used will be from lock5stat.com, more specifically the “HomesForSale” data. This data comprises of 120 different observations across 5 variables. With this data, The following questions will be answered:
The following information is present within this data set:
State - Location of the home (i.e. CA, NJ, NY, and PA)
Price - Asking price (in $1000’s)
Size - Area of all rooms (in 1,000’s sq. ft)
Beds - Number of bedrooms
Baths - Number of bathrooms
These are the five variables used within the 120 observations recorded for this data set. This data was collected from zillow.com in 2019. Using this information, We can gather valuable information regarding how certain factors influence the price of a house, along with any significant differences or relationships the correlate between variables.
##
## Call:
## lm(formula = Price ~ Size, data = ca_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## Size 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
There is a strong significant relationship between the size of a home and its price in California, with p = 0.000463. Observing the slope (0.33919), it shows that for every increase square footage, the price increases by about $339,190. Residual standard error is 219.3, which is the difference between observed and predicted values, is about $219,300. The model itself has a variance of 35.94% (coming from multiple R squared = 0.3594).
##
## Call:
## lm(formula = Price ~ Beds, data = ca_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## Beds 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
For this test, since the p-value is 0.2548, there is no significant relationship regarding the number of bedrooms and the price of a house in California. The slope having an estimate of 84.77 means that the price of a house increases by $84,770 per bedroom added. However, as mentioned before, this relationship between variables is not significant. The residual standard error of 267.6 shows that the average difference between the observed and predicted values is about $267,600. This model has a 4.605% variance for home prices.
##
## Call:
## lm(formula = Price ~ Baths, data = ca_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## Baths 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
The p-value for this test is 0.004092, indicating a strong significant relationship. The slope here is 194.74 indicating that for each additional bathroom added, the price of the house goes up by $194,740. Residual standard error of 235.8 means the average difference between observed and predicted is about $235,800. This model has a variance of 25.88% for the home prices.
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
For this multiple regression model, there are a few things to unpack. Regarding p-values, the size of a home has a significant relationship and influence when comparing that to house prices, as indicated by its p-value of 0.0259. The other variables (beds and baths) do not have a significant relationship with price, with the p - values of each being 0.6239 and 0.2839 respectively. Residual standard error of 221.8 shows the difference between data predicted and observed is about $221,800. This model has a variance of 39.12% for home prices.
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Price ~ State, data = homes)
##
## $State
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951
Looking at the ANOVA model, with a p-value of 0.00148, There is a strong significance in the differences of the price of homes in each state. Gathering further information using Tukey’s HDS post-hoc test, we can see that home prices in California are a lot more expensive as compared to New Jersey (diff of -206.83333 and p-value 0.0044754), New York (diff of -170.03333 and p-value 0.0280402), and Pennsylvania (diff of -269.80 and p-value 0.0001011). Looking at the other states compared to each other shows no significant relationship or differences between them.
From these questions, various results have been obtained and shed light on the various factors that influence the price of homes, especially regarding California. When it comes to size and the number of bathrooms, there is a significant relationship between them and the price of a house. When it comes to bedrooms however, they do not have a significant relationship with the price of a house. regarding pricing of each variable, there is an increase of $339,190 for each additional square footage, $84,770 for each additional bedroom, and $194,740 for each additional bathroom.
When analyzing the differences in prices of homes between states, the ANOVA method found that there is a significant difference in prices among the four states. exploring deeper using Tukey’s HDS post-hoc test, we can see that the prices in California are a lot more expensive when compared to other states, whereas the other three states have no significant differences between each other.
From this data set and analysis, we can conclude that the size and number of bathrooms has an impact on the price of homes in California, whereas bedrooms does not. When comparing California to the other three states, CA has a significantly higher price tag on their homes, while the other three have no significant difference among each other.
Data set - lock5stat (From the list, scroll down to “HomesForSale”)
Documentation for data set - Information pertaining to the details of the data set (Use the legend tab or scroll down to find “HomesForSale”)
Regression, ANOVA, and Tukey’s test are referenced from Dr. Shiju Zhang’s RPubs document “Stat 353 Notes”, specifically chapters 11-13
knitr::opts_chunk$set(echo = TRUE)
#loading data set
homes = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
#---------------QUESTION 1 CODE---------------
#creating subset that only includes California data
ca_data = subset(homes, State == "CA")
#creating regression model
model <- lm(Price ~ Size, data = ca_data)
#display model results
summary(model)
#creating scatter plot to visualize data
plot(ca_data$Size, ca_data$Price,
main = "Home Price Compared to Size in California",
xlab = "Size (square feet)",
ylab = "Price (in $1000's)",
pch = 19,
col = "springgreen4")
#adding regression line
abline(model, col = "red", lwd = 2)
#---------------QUESTION 2 CODE---------------
#creating regression model
model <- lm(Price ~ Beds, data = ca_data)
#display model results
summary(model)
#scatter plot to visualize data
plot(ca_data$Beds, ca_data$Price,
main = "Home Price Comapared to Number of Beds in California",
xlab = "Number of Beds",
ylab = "Price (in $1000's)",
pch = 19,
col = "turquoise4")
#regression line creation
abline(model, col = "red", lwd = 2)
#---------------QUESTION 3 CODE---------------
#creating regression model
model <- lm(Price ~ Baths, data = ca_data)
#display model results
summary(model)
#scatter plot to visualize data
plot(ca_data$Baths, ca_data$Price,
main = "Home Price Comapared to Number of Beds in California",
xlab = "Number of Beds",
ylab = "Price (in $1000's)",
pch = 19,
col = "lightseagreen")
#regression line creation
abline(model, col = "red", lwd = 2)
#---------------QUESTION 4 CODE---------------
#creating regression model
model <- lm(Price ~ Size + Beds + Baths, data = ca_data)
#display model results
summary(model)
#---------------QUESTION 5 CODE---------------
#creating ANOVA model
anova_model = aov(Price ~ State, data = homes)
#display model results
summary(anova_model)
#comparing multiple means using Tukey HSD test
TukeyHSD(anova_model)
boxplot(Price ~ State, data = homes,
col = c("turquoise4", "springgreen4", "lightseagreen", "green3"),
main = "Prices of Homes by State",
xlab = "State",
ylab = "Price (in $1,000's)")