1. Introduction

This is the third project of the STAT 353 course. In this project, we will refer to the HomesForSale data in lock5stat and determine the relationship between home prices and other variables.

The following questions will be addressed in the project:

Use the data only for California. How much does the size of a home influence its price?
Use the data only for California. How does the number of bedrooms of a home influence its price?
Use the data only for California. How does the number of bathrooms in a home influence its price?
Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

2. Data

According to Lock5DataGuide, this is the data on homes for sale in four states in 2019. There are 5 variables and about 120 entries in the dataset. The 5 variables in the data represent the state, the price, the size, the number of bedrooms, and the number of bathrooms. I also consider the fact that the data includes about 30 homes in each state to be quite interesting.

3. Analysis

First, we will need to import the dataset by using the following R code:

# Store data into a variable home
home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")

Then we will address each question individually

Q1. Use the data only for California. How much does the size of a home influence its price?

The null hypothesis states that there is no significant linear relationship between the size of the home and its price. This implies that the slope of the regression line is zero, indicating no association between the size of the home and its price.

The alternative hypothesis states that there is a significant linear relationship between the size of the home and its price. This implies that the slope of the regression line is not zero, indicating a non-zero association between the size of the home and its price.

We will perform a simple linear regression for this question.

model <- lm(Price ~ Size, data = home, subset = (State == "CA"))

summary(model)

## 
## Call:
## lm(formula = Price ~ Size, data = home, subset = (State == "CA"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634

The slope estimate for this case is 0.339. This indicates thatthe price increases by about 0.339 * 1,000 dollars or 339 dollars for an additional square foot.

Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?

We will also perform a simple linear regression for this question.

dataframe_ca <- subset(home, State == "CA")

model <- lm(Price ~ Beds, data = dataframe_ca)

summary(model)

## 
## Call:
## lm(formula = Price ~ Beds, data = dataframe_ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548

plot(dataframe_ca$Beds, dataframe_ca$Price)
abline(model, col="red", lwd=2)

The p-value 0.2548 associated with the slope coefficient indicates that the relationship is not statistically significant at a 0.05 significance level. This means that the number of bedrooms does not have a significant impact on the price of the homes.

Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?

We will also perform a simple linear regression for this question.

dataframe_ca <- subset(home, State == "CA")

model <- lm(Price ~ Baths, data = dataframe_ca)

summary(model)

## 
## Call:
## lm(formula = Price ~ Baths, data = dataframe_ca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092

plot(dataframe_ca$Baths, dataframe_ca$Price)
abline(model, col="red", lwd=2)

The p-value 0.00409 associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01). This means that numbers of bathrooms have a significant impact on the price of the homes.

Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

We will perform a multiple regression for this question.

model <- lm(Price ~ Beds + Baths, data = home, subset = (State == "CA"))

summary(model)

## 
## Call:
## lm(formula = Price ~ Beds + Baths, data = home, subset = (State == 
##     "CA"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -368.75 -182.03   -6.16  152.92  615.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    52.03     223.48   0.233  0.81767   
## Beds           16.40      69.80   0.235  0.81598   
## Baths         189.17      67.64   2.797  0.00939 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239.9 on 27 degrees of freedom
## Multiple R-squared:  0.2604, Adjusted R-squared:  0.2056 
## F-statistic: 4.752 on 2 and 27 DF,  p-value: 0.01705

The p-value for number of bedrooms is 0.81598 means that using both number of bedrooms and number of bathrooms is not significantly better than using number of bathrooms alone to predict the price of the homes.

The p-value for the number of bathrooms is 0.00939 means that using both the number of bedrooms and the number of bathrooms is significantly better than using the number of bedrooms alone to predict the price of the homes.

The p-value of using both numbers of bedrooms and bathrooms is 0.01705. This indicates that the relationship is statistically significant at the 0.05 significance level but not at the 0.01 significance level. This is not really necessary since it would be easier to use the number of bathrooms only to predict prices.

Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

We will use the ANOVA method for this one.

model <- aov(Price ~ State, data = home)

summary(model)

##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## State         3 1198169  399390   7.355 0.000148 ***
## Residuals   116 6299266   54304                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ State, data = home)
## 
## $State
##             diff       lwr        upr     p adj
## NJ-CA -206.83333 -363.6729  -49.99379 0.0044754
## NY-CA -170.03333 -326.8729  -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ   36.80000 -120.0395  193.63955 0.9282064
## PA-NJ  -62.96667 -219.8062   93.87288 0.7224830
## PA-NY  -99.76667 -256.6062   57.07288 0.3505951

boxplot(home$Price ~ home$State)

The p-value is 0.000148, which is quite small compared to the common significance level (even as low as 0.01). The data indicate there are significant differences in the mean home prices among the states where they are located.

Using the post-hoc Tukey’s HSD, we can also see that the p-values of homes in California compared to New Jersey, New York, and Pennsylvania are 0.0044754, 0.0280402, and 0.0001011, respectively. Those p-values are quite small, which indicates a significant difference between home prices in California and other states. As we can see on the boxplot, the homes that are located in California generally have higher prices compared to other states.

4. Summary

Overall, this is an interesting project since it allowed us to use a regression model to find out the relationship between two variables in a dataset and how strong the relationship between them. The one that surprises me would be the home price relative to the number of bedrooms and bathrooms. This is interesting because even if the number of bedrooms has a positive linear relationship, it’s too weak and it is not statistically significant. I did not expect the number of bathrooms to have a much greater impact on the home prices than the number of bedrooms, but the data and regression model tell otherwise. It’s also predictable to witness that the home prices in California are significantly greater than other states.

5. Appendix


# Q1 code
model <- lm(Price ~ Size, data = home, subset = (State == "CA"))
summary(model)

# Q2 code
dataframe_ca <- subset(df, State == "CA")
model <- lm(Price ~ Beds, data = dataframe_ca)
summary(model)

plot(dataframe_ca$Beds, dataframe_ca$Price)
abline(model, col="red", lwd=2)

# Q3 code
dataframe_ca <- subset(df, State == "CA")
model <- lm(Price ~ Baths, data = dataframe_ca)
summary(model)

plot(dataframe_ca$Baths, dataframe_ca$Price)
abline(model, col="red", lwd=2)

# Q4 code
model <- lm(Price ~ Beds + Baths, data = home, subset = (State == "CA"))
summary(model)

# Q5 code
model <- aov(Price ~ State, data = home)
summary(model)

TukeyHSD(model)
boxplot(home$Price ~ home$State)

Project 3: Exploring Homes in CA, NJ, NY, and PA using Regression Model

Minh Quan Tran

2025-04-29