In this report we will analyze data from the HomesForSale data set. The data contains 120 different observations for 5 variables. Those variables are State, Price, Size, Beds, and Bath. In order to detect if one variable has a significant effect on another, an ⍺ value of 0.05 was chosen. This value was chosen because it is very standard with in statistics. If p < 0.05, then there is a significant difference, otherwise there is no significant difference. In this report, I will first predict if the null hypothesis is true, or if the alternative hypothesis is true. If the null hypothesis is true, then there is no statistical difference between the data and any variation is due to noise. If the alternative hypothesis is true, then the two data columns being compared to have a significant statistical effect on each other.
This report will analyze the following questions:
How much does the size of a home influence its price in California? Hypothesis: The alternative hypothesis is true.
How does the number of bedrooms of a home influence its price in California? Hypothesis: The alternative hypothesis is true.
How does the number of bathrooms of a home influence its price in California? Hypothesis: The null hypothesis is true.
How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price in California?
Hypothesis for size ~ price: The alternative hypothesis is true.
Hypothesis for bedrooms ~ price: The alternative hypothesis is true.
Hypothesis for bathrooms ~ price: The null hypothesis is true.
According to the summary of the linear model, the slope for price is ~1.06. This means that for every dollar, the size of the house increases by 1.06 sq ft. The p-value for size and price came out to be 0.000463, meaning that the price will have a significant impact on the size.
My original hypothesis was true.
The summary for the model reported a p-value of 0.255. This means that there is no statistical significance between the number of beds and the price of a home. Meaning that the number of bedrooms does not affect the price of a home. This is shown through the slope of the price being 0.0005433. This means that for each dollar of price, the number of bedrooms increases by 0.0005.
My original hypothesis was false.
The summary for the model gave a p-value of 0.00409. Since the p-value is less that 0.05, this means that the number of bathrooms greatly influences the price of a home in California. The estimated slope is 0.001329, meaning that for every dollar, the number of bathrooms will increase by 0.001329. This can certainly be a lot when homes are hundreds of thousands of dollars.
My original hypothesis was false.
When it comes to price and size, the p-value was 0.0259. Since it is less than 0.05, the size of the home has a big influence of the price of the home. In fact the slope is 0.28. This means that for each dollar of price, the square footage increases by 0.28. For price and number of bedrooms, the p-value was 0.6239. This means that There is no significant relationship between price and number of bedrooms. This is supported by the slope estimate of -33.7. With this slope, for each dollar there are -33.7 bedrooms. It is impossible to have a negative amount of bedrooms in a house, meaning that the difference between the price and number of bedrooms is due to noise in the data. For the number of bathrooms and price, the p-value was 0.2839. This means that there is also no significant relationship between price and number of bathrooms. The estimated slope for these two is 83.9844. This means that for every dollar spent on a house, the number of bathrooms would increase by ~84. This doesn’t make sense when it comes to reality, therefore supporting there not being a significant relationship between price and number of bathrooms.
My original hypothesis for size ~ price was true. My original hypothesis for bedrooms ~ price was false. My original hypothesis for bathrooms ~ price was true.
After running an ANOVA test on the data set, a p-value of 0.000148 was discovered. Since it is less than 0.05, there is a significant difference between CA, NY, NJ, and PA when it comes to the prices of their homes. To find out where this significant difference occurs, I decided to run a TukeyHSD test. For NJ-CA the p-value was 0.0044754, NY-CA’s p-value was 0.0280402, PA-CA’s p-value was 0.0001011, NY-NJ’s p-value was 0.9282064, PA-NJ’s p-value was 0.7224830, and PA-NY’s p-value was 0.3505951. From these p-values, it is clear that California has the biggest price difference in homes compared to New York, New Jersey, or Pennsylvania.
My original hypothesis was true.
In conclusion, we learned that: 1. The size of a house in California will greatly influence the price. 2. The amount of bedrooms in a house in California does not affect the price. 3. The amount of bathrooms in a house in California does not affect the price. 4. The size of the house has the biggest influence in price in California, not the amount of bedrooms or bathrooms. 5. California has an average house price that is significantly larger than New York, New Jersey, or Pennsylvania.
db = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(db)
## State Price Size Beds Baths
## 1 CA 533 1589 3 2.5
## 2 CA 610 2008 3 2.0
## 3 CA 899 2380 5 3.0
## 4 CA 929 1868 3 3.0
## 5 CA 210 1360 2 2.0
## 6 CA 268 2131 3 2.0
#Question 1 code:
modelQ1 <- lm(Size ~ Price, data = subset(db, State == "CA"))
summary(modelQ1)
##
## Call:
## lm(formula = Size ~ Price, data = subset(db, State == "CA"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -549.31 -346.31 14.74 258.57 796.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1178.6075 159.6556 7.382 4.87e-08 ***
## Price 1.0596 0.2673 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 387.5 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
#Question 2 code:
modelQ2 <- lm(Beds ~ Price, data = subset(db, State == "CA"))
summary(modelQ2)
##
## Call:
## lm(formula = Beds ~ Price, data = subset(db, State == "CA"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22223 -0.23459 -0.13639 0.02536 1.95759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8424861 0.2790631 10.186 6.4e-11 ***
## Price 0.0005433 0.0004673 1.163 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6774 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
#Question 3 code:
modelQ3 <- lm(Baths ~ Price, data = subset(db, State == "CA"))
summary(modelQ3)
##
## Call:
## lm(formula = Baths ~ Price, data = subset(db, State == "CA"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50083 -0.36559 0.08877 0.22996 1.45930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.571745 0.253843 6.192 1.09e-06 ***
## Price 0.001329 0.000425 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6161 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
#Question 4 code:
modelQ4 <- lm(Price ~ Size + Beds + Baths, data = subset(db, State == "CA"))
summary(modelQ4)
##
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = subset(db, State ==
## "CA"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -415.47 -130.32 19.64 154.79 384.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.5608 210.3809 -0.198 0.8449
## Size 0.2811 0.1189 2.364 0.0259 *
## Beds -33.7036 67.9255 -0.496 0.6239
## Baths 83.9844 76.7530 1.094 0.2839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.3209
## F-statistic: 5.568 on 3 and 26 DF, p-value: 0.004353
#Question 5 code:
modelQ5 <- aov(Price ~ State, data = db)
summary(modelQ5)
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(modelQ5)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Price ~ State, data = db)
##
## $State
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951