Introduction

This Project will explore The value of homes in the states of California, New York, New Jersey, and Pennsylvania. The data used will be from lock5stat.com, more specifically the “HomesForSale” data. This data comprises of 120 different observations across 5 variables. With this data, The following questions will be answered:

Data

The following information is present within this data set:

State - Location of the home (i.e. CA, NJ, NY, and PA)

Price - Asking price (in $1000’s)

Size - Area of all rooms (in 1,000’s sq. ft)

Beds - Number of bedrooms

Baths - Number of bathrooms

These are the five variables used within the 120 observations recorded for this data set. This data was collected from zillow.com in 2019. Using this information, We can gather valuable information regarding how certain factors influence the price of a house, along with any significant differences or relationships the correlate between variables.

Analysis

Q1: How much does the size of a home in California influence its price?

## 
## Call:
## lm(formula = Price ~ Size, data = ca_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634

There is a strong significant relationship between the size of a home and its price in California, with p = 0.000463. Observing the slope (0.33919), it shows that for every increase square footage, the price increases by about $339,190. Residual standard error is 219.3, which is the difference between observed and predicted values, is about $219,300. The model itself has a variance of 35.94% (coming from multiple R squared = 0.3594).

Q2: How does the number of bedrooms of a home in California influence its price?

## 
## Call:
## lm(formula = Price ~ Beds, data = ca_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548

For this test, since the p-value is 0.2548, there is no significant relationship regarding the number of bedrooms and the price of a house in California. The slope having an estimate of 84.77 means that the price of a house increases by $84,770 per bedroom added. However, as mentioned before, this relationship between variables is not significant. The residual standard error of 267.6 shows that the average difference between the observed and predicted values is about $267,600. This model has a 4.605% variance for home prices.

Q3: How does the number of bathrooms of a home in California influence its price?

## 
## Call:
## lm(formula = Price ~ Baths, data = ca_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092

The p-value for this test is 0.004092, indicating a strong significant relationship. The slope here is 194.74 indicating that for each additional bathroom added, the price of the house goes up by $194,740. Residual standard error of 235.8 means the average difference between observed and predicted is about $235,800. This model has a variance of 25.88% for the home prices.

Q4: How do the size, the number of bedrooms, and the number of bathrooms of a home in California jointly influence its price?

## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353

For this multiple regression model, there are a few things to unpack. Regarding p-values, the size of a home has a significant relationship and influence when comparing that to house prices, as indicated by its p-value of 0.0259. The other variables (beds and baths) do not have a significant relationship with price, with the p - values of each being 0.6239 and 0.2839 respectively. Residual standard error of 221.8 shows the difference between data predicted and observed is about $221,800. This model has a variance of 39.12% for home prices.

Q5: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = homes)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951

Looking at the ANOVA model, with a p-value of 0.00148, There is a strong significance in the differences of the price of homes in each state. Gathering further information using Tukey’s HDS post-hoc test, we can see that home prices in California are a lot more expensive as compared to New Jersey (diff of -206.83333 and p-value 0.0044754), New York (diff of -170.03333 and p-value 0.0280402), and Pennsylvania (diff of -269.80 and p-value 0.0001011). Looking at the other states compared to each other shows no significant relationship or differences between them.

Sumamry

From these questions, various results have been obtained and shed light on the various factors that influence the price of homes, especially regarding California. When it comes to size and the number of bathrooms, there is a significant relationship between them and the price of a house. When it comes to bedrooms however, they do not have a significant relationship with the price of a house. regarding pricing of each variable, there is an increase of $339,190 for each additional square footage, $84,770 for each additional bedroom, and $194,740 for each additional bathroom.

When analyzing the differences in prices of homes between states, the ANOVA method found that there is a significant difference in prices among the four states. exploring deeper using Tukey’s HDS post-hoc test, we can see that the prices in California are a lot more expensive when compared to other states, whereas the other three states have no significant differences between each other.

From this data set and analysis, we can conclude that the size and number of bathrooms has an impact on the price of homes in California, whereas bedrooms does not. When comparing California to the other three states, CA has a significantly higher price tag on their homes, while the other three have no significant difference among each other.

References and Project Code

Data set - lock5stat (From the list, scroll down to “HomesForSale”)

Documentation for data set - Information pertaining to the details of the data set (Use the legend tab or scroll down to find “HomesForSale”)

Regression, ANOVA, and Tukey’s test are referenced from Dr. Shiju Zhang’s RPubs document “Stat 353 Notes”, specifically chapters 11-13

knitr::opts_chunk$set(echo = TRUE)
#loading data set
homes = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")

#---------------QUESTION 1 CODE---------------

#creating subset that only includes California data
ca_data = subset(homes, State == "CA")

#creating regression model
model <- lm(Price ~ Size, data = ca_data)

#display model results
summary(model)

#creating scatter plot to visualize data
plot(ca_data$Size, ca_data$Price,
     main = "Home Price Compared to Size in California",
     xlab = "Size (square feet)",
     ylab = "Price (in $1000's)",
     pch = 19,
     col = "springgreen4")

#adding regression line
abline(model, col = "red", lwd = 2)

#---------------QUESTION 2 CODE---------------

#creating regression model
model <- lm(Price ~ Beds, data = ca_data)

#display model results
summary(model)

#scatter plot to visualize data
plot(ca_data$Beds, ca_data$Price,
     main = "Home Price Comapared to Number of Beds in California",
     xlab = "Number of Beds",
     ylab = "Price (in $1000's)",
     pch = 19,
     col = "turquoise4")

#regression line creation
abline(model, col = "red", lwd = 2)

#---------------QUESTION 3 CODE---------------

#creating regression model
model <- lm(Price ~ Baths, data = ca_data)

#display model results
summary(model)

#scatter plot to visualize data
plot(ca_data$Baths, ca_data$Price,
     main = "Home Price Comapared to Number of Beds in California",
     xlab = "Number of Beds",
     ylab = "Price (in $1000's)",
     pch = 19,
     col = "lightseagreen")

#regression line creation
abline(model, col = "red", lwd = 2)

#---------------QUESTION 4 CODE---------------

#creating regression model
model <- lm(Price ~ Size + Beds + Baths, data = ca_data)

#display model results
summary(model)

#---------------QUESTION 5 CODE---------------

#creating ANOVA model
anova_model = aov(Price ~ State, data = homes)

#display model results
summary(anova_model)

#comparing multiple means using Tukey HSD test
TukeyHSD(anova_model)

boxplot(Price ~ State, data = homes,
        col = c("turquoise4", "springgreen4", "lightseagreen", "green3"),
        main = "Prices of Homes by State",
        xlab = "State",
        ylab = "Price (in $1,000's)")