Introduction

The dataset “HomesForSale” provides detailed information on homes for sale across four U.S. states: California (CA), New Jersey (NJ), New York (NY), and Pennsylvania (PA). This dataset includes 120 observations, capturing key attributes of homes that were on the market in 2019. We will be analyzing these 5 questions given below:

  1. Use the data only for California. How much does the size of a home influence its price?

  2. Use the data only for California. How does the number of bedrooms of a home influence its price?

  3. Use the data only for California. How does the number of bathrooms of a home influence its price?

  4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

  5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

Analysis

We will explore the questions in detail.

home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home)
##   State Price Size Beds Baths
## 1    CA   533 1589    3   2.5
## 2    CA   610 2008    3   2.0
## 3    CA   899 2380    5   3.0
## 4    CA   929 1868    3   3.0
## 5    CA   210 1360    2   2.0
## 6    CA   268 2131    3   2.0

Q1. Use the data only for California. How much does the size of a home influence its price?

california_data <- subset(home, State == "CA")
size_model <- lm(Price ~ Size, data = california_data)
summary(size_model)
## 
## Call:
## lm(formula = Price ~ Size, data = california_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.55 -139.69   39.24  147.65  352.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.81675  154.68102  -0.367 0.716145    
## Size          0.33919    0.08558   3.963 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared:  0.3594, Adjusted R-squared:  0.3365 
## F-statistic: 15.71 on 1 and 28 DF,  p-value: 0.0004634
ggplot(california_data, aes(x = Size, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Size Influence on Price in California",
       x = "Size (SqFt)",
       y = "Price ($)")
## `geom_smooth()` using formula = 'y ~ x'

Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?

bedrooms_model <- lm(Price ~ Beds, data = california_data)
summary(bedrooms_model)
## 
## Call:
## lm(formula = Price ~ Beds, data = california_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.83 -236.62   29.94  197.69  570.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   269.76     233.62   1.155    0.258
## Beds           84.77      72.91   1.163    0.255
## 
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared:  0.04605,    Adjusted R-squared:  0.01198 
## F-statistic: 1.352 on 1 and 28 DF,  p-value: 0.2548
ggplot(california_data, aes(x = Beds, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Bedrooms Influence on Price in California",
       x = "Number of Bedrooms",
       y = "Price ($)")
## `geom_smooth()` using formula = 'y ~ x'

### Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?

bathrooms_model <- lm(Price ~ Baths, data = california_data)
summary(bathrooms_model)
## 
## Call:
## lm(formula = Price ~ Baths, data = california_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -374.93 -181.56   -2.74  152.31  614.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    90.71     148.57   0.611  0.54641   
## Baths         194.74      62.28   3.127  0.00409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.2324 
## F-statistic: 9.779 on 1 and 28 DF,  p-value: 0.004092
ggplot(california_data, aes(x = Baths, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Bathrooms Influence on Price in California",
       x = "Number of Bathrooms",
       y = "Price ($)")
## `geom_smooth()` using formula = 'y ~ x'

The histogram shows the distribution of college costs. Most colleges fall within a moderate cost range.

Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?

multiple_model <- lm(Price ~ Size + Beds + Baths, data = california_data)
summary(multiple_model)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = california_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
ggplot(california_data, aes(x = as.factor(Baths), y = Price)) +
  geom_boxplot() +
  labs(title = "Bathrooms Influence on Price in California",
       x = "Number of Bathrooms",
       y = "Price ($)") +
  theme_minimal()

The correlation between enrollment and cost is -0.1914363 which means that those two has an inverse correlation meaning that when cost gets high, enrollment decreases.Similarly, when enrollment goes high, cost decreases.

Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.

state_model <- lm(Price ~ State, data = home)
anova(state_model)
## Analysis of Variance Table
## 
## Response: Price
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## State       3 1198169  399390  7.3547 0.0001482 ***
## Residuals 116 6299266   54304                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(multiple_model)

multiple_model <- lm(Price ~ Size + Beds + Baths, data = california_data)
summary(multiple_model)
## 
## Call:
## lm(formula = Price ~ Size + Beds + Baths, data = california_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415.47 -130.32   19.64  154.79  384.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -41.5608   210.3809  -0.198   0.8449  
## Size          0.2811     0.1189   2.364   0.0259 *
## Beds        -33.7036    67.9255  -0.496   0.6239  
## Baths        83.9844    76.7530   1.094   0.2839  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221.8 on 26 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:  0.3209 
## F-statistic: 5.568 on 3 and 26 DF,  p-value: 0.004353
# Boxplot showing Price distribution among the four states
ggplot(home, aes(x = State, y = Price)) +
  geom_boxplot() +
  labs(title = "Differences in Home Prices among States",
       x = "State",
       y = "Price ($)")

#Appendix

The following R code was used for data analysis and visualization for the different questions in this report.

library(ggplot2)
home <- read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")

california_data <- subset(home, State == "CA")
size_model <- lm(Price ~ Size, data = california_data)
summary(size_model)
# Scatter plot for Size vs. Price in California
ggplot(california_data, aes(x = Size, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Size Influence on Price in California",
       x = "Size (SqFt)",
       y = "Price ($)")

bedrooms_model <- lm(Price ~ Beds, data = california_data)
summary(bedrooms_model)
# Scatter plot for Bedrooms vs. Price in California
ggplot(california_data, aes(x = Beds, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Bedrooms Influence on Price in California",
       x = "Number of Bedrooms",
       y = "Price ($)")

bathrooms_model <- lm(Price ~ Baths, data = california_data)
summary(bathrooms_model)
# Scatter plot for Bathrooms vs. Price in California
ggplot(california_data, aes(x = Baths, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Bathrooms Influence on Price in California",
       x = "Number of Bathrooms",
       y = "Price ($)")

multiple_model <- lm(Price ~ Size + Beds + Baths, data = california_data)
summary(multiple_model)
# Boxplot for Baths vs. Price in California
ggplot(california_data, aes(x = as.factor(Baths), y = Price)) +
  geom_boxplot() +
  labs(title = "Bathrooms Influence on Price in California",
       x = "Number of Bathrooms",
       y = "Price ($)") +
  theme_minimal()

state_model <- lm(Price ~ State, data = home)
anova(state_model)
# Diagnostic plots for multiple regression
par(mfrow = c(2, 2))
plot(multiple_model)


multiple_model <- lm(Price ~ Size + Beds + Baths, data = california_data)
summary(multiple_model)


# Boxplot showing Price distribution among the four states
ggplot(home, aes(x = State, y = Price)) +
  geom_boxplot() +
  labs(title = "Differences in Home Prices among States",
       x = "State",
       y = "Price ($)")