Introduction

This project covers a series of questions surrounding a housing data set from https://www.lock5stat.com/datasets3e/HomesForSale.csv. The questions are mostly based on California houses and about how certain variables affect housing prices. By answering these question using linear regression we should be able to gain insights on how various factors can affect a houses price point.

Data

Home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(Home)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0

Questions

Q1. Use the data only for California. How much does the size of a home influence its price?

Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?

Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?

Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

Analysis

(1) Use the data only for California. How much does the size of a home influence its price?

sp = data.frame(Home$Size, Home$Price)

# Perform simple linear regression
lm1_model <- lm(Home$Size ~ Home$Price, data = sp)

# Print the summary of the regression
summary(lm1_model)
## 
## Call:
## lm(formula = Home$Size ~ Home$Price, data = sp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1167.9  -360.5  -117.8   228.1  2310.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1444.2737   109.1487  13.232  < 2e-16 ***
## Home$Price     1.1195     0.2428   4.611 1.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 664.8 on 118 degrees of freedom
## Multiple R-squared:  0.1527, Adjusted R-squared:  0.1455 
## F-statistic: 21.26 on 1 and 118 DF,  p-value: 1.022e-05
  • The Estimate (1.1195) for the slope coefficient indicates Home price increases about 1.1195 per unit increase in size.

The “Residual standard error” (664.8) indicates the average difference between observed and predicted values.

The R-squared value (0.1455) indicates that 14.55% of total variation in Home size is explained (or accounted for) by price.

The p-value (1.02e-05 or 0, basically) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).

(2) Use the data only for California. How does the number of bedrooms of a home influence its price?

bp = data.frame(Home$Beds, Home$Price)

# Perform simple linear regression
lm2_model <- lm(Home$Beds ~ Home$Price, data = bp)

# Print the summary of the regression
summary(lm2_model)
## 
## Call:
## lm(formula = Home$Beds ~ Home$Price, data = bp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1555 -0.3620 -0.2432  0.6022  2.7536 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.1241726  0.1270481  24.590   <2e-16 ***
## Home$Price  0.0006266  0.0002826   2.217   0.0285 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7738 on 118 degrees of freedom
## Multiple R-squared:   0.04,  Adjusted R-squared:  0.03186 
## F-statistic: 4.917 on 1 and 118 DF,  p-value: 0.02851
  • The p-value (0.0285) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level.

(3) Use the data only for California. How does the number of bathrooms of a home influence its price?

bap = data.frame(Home$Baths, Home$Price)

# Perform simple linear regression
lm3_model <- lm(Home$Baths ~ Home$Price, data = bap)

# Print the summary of the regression
summary(lm3_model)
## 
## Call:
## lm(formula = Home$Baths ~ Home$Price, data = bap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8583 -0.3796 -0.1406  0.6073  2.3516 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.8735308  0.1290746  14.515  < 2e-16 ***
## Home$Price  0.0014088  0.0002871   4.907 2.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7861 on 118 degrees of freedom
## Multiple R-squared:  0.1695, Adjusted R-squared:  0.1624 
## F-statistic: 24.08 on 1 and 118 DF,  p-value: 2.992e-06
  • The p-value (.000002992 or basically 0) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).

(4) Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

#create data frame
data <- data.frame(Home$Size, Home$Beds, Home$Baths, Home$Price)

# Perform multiple linear regression
model <- lm(Home$Price ~ Home$Size + Home$Beds + Home$Baths, data = data)

# Print the summary of the regression model
summary(model)
## 
## Call:
## lm(formula = Home$Price ~ Home$Size + Home$Beds + Home$Baths, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -352.31 -157.69  -68.89   86.14  745.66 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 103.75177   92.91802   1.117   0.2665  
## Home$Size     0.08199    0.04264   1.923   0.0570 .
## Home$Beds   -25.80554   32.82340  -0.786   0.4334  
## Home$Baths   84.95750   34.48394   2.464   0.0152 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228.1 on 116 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1745 
## F-statistic: 9.385 on 3 and 116 DF,  p-value: 1.329e-05

How can we interpret each of the coefficients and each p-value?

The coefficient 0.08199 means that for each size unit value increase, the price increases by about $0.08, holding Beds and Baths constant. The coefficient -25.80554 means that for each unit increase in Beds, the price decreases by about -$25.81, holding Size and Baths constant. The coefficient 84.95750 means that for each unit increase in Baths, the price decreases by about $84.96, holding Size and Beds constant.

The smaller p-values (0.0570 and 0.0152) indicate both variables have significant impact on Home price. The Larger p-value (0.4334) indicates Beds variables has no significant impact on Home price.

(5) Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

# Form a data frame
myData <- data.frame(Home$Price, Home$State)

# Perform one-way ANOVA
model2 <- aov(Home$Price ~ Home$State, data = myData)
summary(model2)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## Home$State    3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Conduct post-hoc Tukey's HSD test
TukeyHSD(model2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Home$Price ~ Home$State, data = myData)
## 
## $`Home$State`
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951
plot(model2, 1)

plot(model2, 2)

Since the p-value (0.000148) is basically zero, the data indicate there are significant differences in the mean house price values among the States.

The see which states make a difference in price, we can conduct a post-hoc Tukey’s HSD test.

Since all adjusted p-values are quite small (smaller than the commonly used significance levels), there is a significantly different hardness values between any two levels of the cooling time. This information can guide the optimization of the cooling process to achieve the desired hardness properties for engineering applications.

The small significant values (0.0044754, 0.0280402, 0.0001011), mean that their is a significant difference in housing prices between (NJ-CA), (NY-CA), (PA-CA). Meanwhile the larger values (0.9282064, 0.7224830, 0.3505951), mean that their isn’t a significant difference in housing prices between (NY-NJ), (PA-NJ), (PA-NY).

We can see this looking at the graph where half of the dots fit the model while the other are off model.

Summary Answers

(1) Use the data only for California. How much does the size of a home influence its price?

The Estimate (1.1195) for the slope coefficient indicates Home price increases about $1.1195 per unit increase in size.

(2) Use the data only for California. How does the number of bedrooms of a home influence its price?

The p-value (0.0285) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level.

(3) Use the data only for California. How does the number of bathrooms of a home influence its price?

The p-value (.000002992 or basically 0) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).

(4) Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

How can we interpret each of the coefficients and each p-value?

The coefficient 0.08199 means that for each size unit value increase, the price increases by about $0.08, holding Beds and Baths constant. The coefficient -25.80554 means that for each unit increase in Beds, the price decreases by about -$25.81, holding Size and Baths constant. The coefficient 84.95750 means that for each unit increase in Baths, the price decreases by about $84.96, holding Size and Beds constant.

The smaller p-values (0.0570 and 0.0152) indicate both variables have significant impact on Home price. The Larger p-value (0.4334) indicates Beds variables has no significant impact on Home price.

(5) Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

Since the p-value (0.000148) is basically zero, the data indicate there are significant differences in the mean house price values among the States.

The see which states make a difference in price, we can conduct a post-hoc Tukey’s HSD test.

Since all adjusted p-values are quite small (smaller than the commonly used significance levels), there is a significantly different hardness values between any two levels of the cooling time. This information can guide the optimization of the cooling process to achieve the desired hardness properties for engineering applications.

The small significant values (0.0044754, 0.0280402, 0.0001011), mean that their is a significant difference in housing prices between (NJ-CA), (NY-CA), (PA-CA). Meanwhile the larger values (0.9282064, 0.7224830, 0.3505951), mean that their isn’t a significant difference in housing prices between (NY-NJ), (PA-NJ), (PA-NY).

plot(model2, 1)

plot(model2, 2)

We can see this looking at the graph where half of the dots fit the model while the other are off model.

Conclusion

Using linear regression we were able to find which variables had significant impacts on housing prices. We were also able to reason out why these findings make sense.

Appendix

# Q1 mean sp = data.frame(Home$Size, Home$Price) | lm1_model <- lm(Home$Size ~ Home$Price, data = sp) | summary(lm1_model)
# Q2 bp = data.frame(Home$Beds, Home$Price) | lm2_model <- lm(Home$Beds ~ Home$Price, data = bp) | summary(lm2_model)
# Q3 bap = data.frame(Home$Baths, Home$Price) | lm3_model <- lm(Home$Baths ~ Home$Price, data = bap) | summary(lm3_model)
# Q4 data <- data.frame(Home$Size, Home$Beds, Home$Baths, Home$Price) | model <- lm(Home$Price ~ Home$Size + Home$Beds + Home$Baths, data = data) | summary(model)
# Q5 myData <- data.frame(Home$Price, Home$State) | model2 <- aov(Home$Price ~ Home$State, data = myData) | summary(model2) | TukeyHSD(model2) | plot(model2, 1) | plot(model2, 2)