home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")

Introduction

For this project I will be exploring the 5 questions provided to us on D2L. They are:

  1. Using the data only for California, How much does the size of a home influence its price?

  2. Using the data only for California, How does the number of bedrooms of a home influence its price?

  3. Using the data only for California, How does the number of bathrooms of a home influence its price?

  4. Using the data only for California, How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

  5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

Data

All of the data for this project was sourced from the website: https://www.lock5stat.com/datapage3e.html. The data is under the name HomesForSale, and the analysis will be done using the excel sheet downloaded from that website. Here is a list of variables and descriptions from the website:

Analysis

Question 1: Using the data only for California, How much does the size of a home influence its price?

ca_home <- subset(home, State == "CA")
model1 <- lm(Price ~ Size, data = ca_home)
summary(model1)
## 
## Call:
## lm(formula = Price ~ Size, data = ca_home)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634

The slope is approximately 0.34, so the price increases by 340 dollars for each additional 1000 square feet. We know the slope is significant because the p-value is far less than 0.05, so the hypothesis that the slope is 0 is rejected.

Question 2: Using the data only for California, How does the number of bedrooms of a home influence its price?

model2 <- lm(Price ~ Beds, data = ca_home)
summary(model2)
## 
## Call:
## lm(formula = Price ~ Beds, data = ca_home)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548

The p-value is 0.255, which means we do not reject the null hypothesis. Even though the slope is positive with this data set, we cannot conclude the slope is significant, and other facotrs are more important in defining the price of a home in California.

Question 3: Using the data only for California, How does the number of bathrooms of a home influence its price?

model3 <- lm(Price ~ Baths, data = ca_home)
summary(model3)
## 
## Call:
## lm(formula = Price ~ Baths, data = ca_home)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092

The p-value is less than the 0.05 threshold, so we reject the null hypothesis, and conclude number of bathrooms has a statistically significant influence on price of a home in California. Also, the slope for this graph is 194.74, meaning the price increases $194,740 per bathroom.

Question 4: Using the data only for California, How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

model4 <- lm(Price ~ Size + Beds + Baths, data = ca_home)
summary(model4)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = ca_home)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353

The overall model is statistically significant because its p-value is less than 0.05, indicating that size, number of bedrooms, and number of bathrooms have an influence on cost when they are considered jointly. However, only size has a significant individual effect on the price, as it is the only p-value under 0.05. Bedrooms and bathrooms do not appear to have statistically significant contributions to price when size is included in the model.

Question 5: Are there significant differences in home prices among the four states (CA, NY, NJ, PA)?

model5 <- aov(Price ~ State, data = home)
summary(model5)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is much lower than the 0.05 threshold, so we can conclude that state does have a significant influence on price. There aren’t any other relevant conclusions we can draw with this data.

Summary

In this project, I explored how the price of a house in California changes based on size, number of bathrooms, and number of bedrooms. I did this by checking first if there was a statistically significant influence on the price by each of the variables, then finding the slope for the line of best fit of the data. The slope is only relevant for size and bathrooms, as there is not a significant influence on the price by the number of bedrooms. The slope is useful because it gives you an average for how much the price changes based on a change in size or a change in number of bathrooms. Then, I ran a joint test to check influence of all the factors combined, and compare them to each other. It turns out size is the only signifact factor when they are all combined into a single test. Finally, I ran an ANOVA test to check if the state has a significant influence on the price, which it turns out it does.

References

All the data for this lab was sourced from this website: https://www.lock5stat.com/datapage3e.html