Introduction

In this report we will analyze data from the HomesForSale data set. The data contains 120 different observations for 5 variables. Those variables are State, Price, Size, Beds, and Bath. In order to detect if one variable has a significant effect on another, an ⍺ value of 0.05 was chosen. This value was chosen because it is very standard with in statistics. If p < 0.05, then there is a significant difference, otherwise there is no significant difference. In this report, I will first predict if the null hypothesis is true, or if the alternative hypothesis is true. If the null hypothesis is true, then there is no statistical difference between the data and any variation is due to noise. If the alternative hypothesis is true, then the two data columns being compared to have a significant statistical effect on each other.

Questions:

This report will analyze the following questions:

  1. How much does the size of a home influence its price in California? Hypothesis: The alternative hypothesis is true.

  2. How does the number of bedrooms of a home influence its price in California? Hypothesis: The alternative hypothesis is true.

  3. How does the number of bathrooms of a home influence its price in California? Hypothesis: The null hypothesis is true.

  4. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price in California?

Hypothesis for size ~ price: The alternative hypothesis is true.

Hypothesis for bedrooms ~ price: The alternative hypothesis is true.

Hypothesis for bathrooms ~ price: The null hypothesis is true.

  1. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? Hypothesis: The alternative hypothesis is true.

Analysis

Question 1: How much does the size of a home influence its price in California?

According to the summary of the linear model, the slope for price is ~1.06. This means that for every dollar, the size of the house increases by 1.06 sq ft. The p-value for size and price came out to be 0.000463, meaning that the price will have a significant impact on the size.

My original hypothesis was true.

Question 2: How does the number of bedrooms of a home influence its price in California?

The summary for the model reported a p-value of 0.255. This means that there is no statistical significance between the number of beds and the price of a home. Meaning that the number of bedrooms does not affect the price of a home. This is shown through the slope of the price being 0.0005433. This means that for each dollar of price, the number of bedrooms increases by 0.0005.

My original hypothesis was false.

Question 3: How does the number of bathrooms of a home influence its price in California?

The summary for the model gave a p-value of 0.00409. Since the p-value is less that 0.05, this means that the number of bathrooms greatly influences the price of a home in California. The estimated slope is 0.001329, meaning that for every dollar, the number of bathrooms will increase by 0.001329. This can certainly be a lot when homes are hundreds of thousands of dollars.

My original hypothesis was false.

Question 4: How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price in California?

When it comes to price and size, the p-value was 0.0259. Since it is less than 0.05, the size of the home has a big influence of the price of the home. In fact the slope is 0.28. This means that for each dollar of price, the square footage increases by 0.28. For price and number of bedrooms, the p-value was 0.6239. This means that There is no significant relationship between price and number of bedrooms. This is supported by the slope estimate of -33.7. With this slope, for each dollar there are -33.7 bedrooms. It is impossible to have a negative amount of bedrooms in a house, meaning that the difference between the price and number of bedrooms is due to noise in the data. For the number of bathrooms and price, the p-value was 0.2839. This means that there is also no significant relationship between price and number of bathrooms. The estimated slope for these two is 83.9844. This means that for every dollar spent on a house, the number of bathrooms would increase by ~84. This doesn’t make sense when it comes to reality, therefore supporting there not being a significant relationship between price and number of bathrooms.

My original hypothesis for size ~ price was true. My original hypothesis for bedrooms ~ price was false. My original hypothesis for bathrooms ~ price was true.

Question 5:

After running an ANOVA test on the data set, a p-value of 0.000148 was discovered. Since it is less than 0.05, there is a significant difference between CA, NY, NJ, and PA when it comes to the prices of their homes. To find out where this significant difference occurs, I decided to run a TukeyHSD test. For NJ-CA the p-value was 0.0044754, NY-CA’s p-value was 0.0280402, PA-CA’s p-value was 0.0001011, NY-NJ’s p-value was 0.9282064, PA-NJ’s p-value was 0.7224830, and PA-NY’s p-value was 0.3505951. From these p-values, it is clear that California has the biggest price difference in homes compared to New York, New Jersey, or Pennsylvania.

My original hypothesis was true.

Summary

In conclusion, we learned that: 1. The size of a house in California will greatly influence the price. 2. The amount of bedrooms in a house in California does not affect the price. 3. The amount of bathrooms in a house in California does not affect the price. 4. The size of the house has the biggest influence in price in California, not the amount of bedrooms or bathrooms. 5. California has an average house price that is significantly larger than New York, New Jersey, or Pennsylvania.

Appendix

db = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(db)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0
#Question 1 code:

modelQ1 <- lm(Size ~ Price, data = subset(db, State == "CA"))
summary(modelQ1)
## 
## Call:
## lm(formula = Size ~ Price, data = subset(db, State == "CA"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -549.31 -346.31   14.74  258.57  796.39 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1178.6075   159.6556   7.382 4.87e-08 ***
## Price          1.0596     0.2673   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 387.5 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634
#Question 2 code:

modelQ2 <- lm(Beds ~ Price, data = subset(db, State == "CA"))
summary(modelQ2)
## 
## Call:
## lm(formula = Beds ~ Price, data = subset(db, State == "CA"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22223 -0.23459 -0.13639  0.02536  1.95759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.8424861  0.2790631  10.186  6.4e-11 ***
## Price       0.0005433  0.0004673   1.163    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6774 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548
#Question 3 code:

modelQ3 <- lm(Baths ~ Price, data = subset(db, State == "CA"))
summary(modelQ3)
## 
## Call:
## lm(formula = Baths ~ Price, data = subset(db, State == "CA"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50083 -0.36559  0.08877  0.22996  1.45930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.571745   0.253843   6.192 1.09e-06 ***
## Price       0.001329   0.000425   3.127  0.00409 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6161 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092
#Question 4 code:
modelQ4 <- lm(Price ~ Size + Beds + Baths, data = subset(db, State == "CA"))
summary(modelQ4)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = subset(db, State == 
##     "CA"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
#Question 5 code:
modelQ5 <- aov(Price ~ State, data = db)
summary(modelQ5)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(modelQ5)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = db)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951