project01

Author

Hangu Lee

1. Data Exploration

# Load the dataset
homeprice <- read.csv("homeprice.csv")

# Visualization: Histogram of Sale Price
hist(homeprice$sale, 
     main="Distribution of Sale Prices", 
     xlab="Sale Price", 
     col="lightblue")

# Check all relationships using scatterplots or boxplots
plot(sale ~ list, data=homeprice, main="Sale Price vs List Price")

plot(sale ~ bedrooms, data=homeprice, main="Sale Price vs Bedrooms")

plot(sale ~ full, data=homeprice, main="Sale Price vs Full Bathrooms")

plot(sale ~ half, data=homeprice, main="Sale Price vs Half Bathrooms")

plot(sale ~ rooms, data=homeprice, main="Sale Price vs Number of Rooms")

plot(sale ~ factor(neighborhood), data=homeprice, main="Sale Price by Neighborhood Rank")

In conclusion, list price and neighborhood rank appear to have the strongest and most definitive relationships with sale price. rooms shows a moderate positive effect, while bedrooms, full bathrooms, and half bathrooms show much weaker or non-linear associations with the final sale price.

2. First Model: Sale Price Regression

# Build a multiple linear regression model to explain sale price
model_sale <- lm(sale ~ list + full + half + bedrooms + rooms + factor(neighborhood), data=homeprice)

# Check the coefficients and goodness-of-fit
summary(model_sale)

Call:
lm(formula = sale ~ list + full + half + bedrooms + rooms + factor(neighborhood), 
    data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.149  -6.679   1.486   4.364  24.149 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            27.96941   15.84528   1.765   0.0936 .  
list                    0.91554    0.07965  11.494 5.34e-10 ***
full                   -0.04073    5.49422  -0.007   0.9942    
half                    5.07142    6.08110   0.834   0.4147    
bedrooms               -4.39758    6.64979  -0.661   0.5164    
rooms                   2.44383    3.66230   0.667   0.5126    
factor(neighborhood)2 -26.11476   12.69888  -2.056   0.0537 .  
factor(neighborhood)3 -11.42126   14.32365  -0.797   0.4351    
factor(neighborhood)4  -4.41077   20.66629  -0.213   0.8333    
factor(neighborhood)5  -4.39556   30.46718  -0.144   0.8868    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.86 on 19 degrees of freedom
Multiple R-squared:  0.9919,    Adjusted R-squared:  0.988 
F-statistic: 257.1 on 9 and 19 DF,  p-value: < 2.2e-16
# Identify which variable appears to have the greatest effect on sale price
anova(model_sale)
Analysis of Variance Table

Response: sale
                     Df Sum Sq Mean Sq   F value Pr(>F)    
list                  1 381050  381050 2305.6632 <2e-16 ***
full                  1    156     156    0.9443 0.3434    
half                  1     21      21    0.1271 0.7254    
bedrooms              1     25      25    0.1529 0.7001    
rooms                 1      3       3    0.0164 0.8994    
factor(neighborhood)  4   1107     277    1.6748 0.1973    
Residuals            19   3140     165                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the distribution of residuals
hist(residuals(model_sale), 
     main="Histogram of Residuals (Sale Model)", 
     xlab="Residuals", 
     col="salmon")

# Residuals vs Fitted plot for model diagnostics
plot(model_sale, which=1)

The multiple linear regression model for sale price demonstrates an excellent fit, with an R² of 0.99. According to both the coefficient estimates and the ANOVA results, list price is the strongest and most statistically significant predictor of sale price. After controlling for list price, other variables such as bathrooms, bedrooms, rooms, and neighborhood rank contribute relatively little additional explanatory power. Diagnostic plots suggest that the residuals are approximately normally distributed and do not exhibit severe violations of the homoscedasticity assumption, although a few potential outliers and slight departures from linearity are visible at higher fitted values. Overall, the model appears to provide a reliable representation of the relationship between the predictors and sale price.

3. Second Model: List Price Regression

# Build a second model to explain list price
model_list <- lm(list ~ full + half + bedrooms + rooms + factor(neighborhood), data=homeprice)

# Check the coefficients and goodness-of-fit for the list price model
summary(model_list)

Call:
lm(formula = list ~ full + half + bedrooms + rooms + factor(neighborhood), 
    data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.506 -24.097  -0.703  22.762  54.684 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -10.294     44.424  -0.232 0.819113    
full                    29.280     13.966   2.097 0.048950 *  
half                    50.090     12.884   3.888 0.000914 ***
bedrooms                10.500     18.520   0.567 0.577065    
rooms                    9.225     10.072   0.916 0.370624    
factor(neighborhood)2   25.163     35.203   0.715 0.483005    
factor(neighborhood)3  100.995     33.271   3.036 0.006531 ** 
factor(neighborhood)4  188.939     39.763   4.752 0.000122 ***
factor(neighborhood)5  300.699     52.865   5.688 1.44e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.09 on 20 degrees of freedom
Multiple R-squared:  0.9358,    Adjusted R-squared:  0.9102 
F-statistic: 36.47 on 8 and 20 DF,  p-value: 2.82e-10
# Identify which variable appears to have the greatest effect on list price
anova(model_list)
Analysis of Variance Table

Response: list
                     Df Sum Sq Mean Sq  F value    Pr(>F)    
full                  1 169594  169594 130.2069 3.292e-10 ***
half                  1  92249   92249  70.8248 5.283e-08 ***
bedrooms              1   9745    9745   7.4815   0.01275 *  
rooms                 1  10162   10162   7.8021   0.01122 *  
factor(neighborhood)  4  98264   24566  18.8607 1.454e-06 ***
Residuals            20  26050    1302                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the distribution of residuals
hist(residuals(model_list), 
     main="Histogram of Residuals (List Model)", 
     xlab="Residuals", 
     col="lightgreen")

# Residuals vs Fitted plot for model diagnostics
plot(model_list, which=1)

A second multiple regression model was developed using the same explanatory variables to predict list price. The model provides a strong fit (R² = 0.9358, Adjusted R² = 0.9102) and is highly significant overall (F = 36.47, p < 0.001). According to the ANOVA table, the number of full bathrooms has the greatest effect on list price (F = 130.21), followed by half bathrooms and neighborhood rank. This differs from the sale price model, where list price itself was by far the most important predictor of sale price. The results suggest that homeowners and real estate agents primarily use bathroom availability and neighborhood quality when establishing an initial listing price. Therefore, a real estate agent should focus on highlighting bathrooms and neighborhood prestige when marketing a property and determining its list price.

4. Neighborhood Effect on Price Difference

# Calculate the difference between sale price and list price
homeprice$price_diff <- homeprice$sale - homeprice$list

# Visualize the effect of neighborhood on the price difference
boxplot(price_diff ~ factor(neighborhood), data=homeprice, 
        main="Price Difference (Sale - List) by Neighborhood",
        xlab="Neighborhood Rank (1:Poor, 5:Rich)", 
        ylab="Price Difference (Sale - List)", 
        col="lightyellow")

# Check if the effect is statistically significant
model_diff <- lm(price_diff ~ factor(neighborhood), data=homeprice)
summary(model_diff)

Call:
lm(formula = price_diff ~ factor(neighborhood), data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.000  -4.875   0.150   5.125  26.000 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)             18.650      8.626   2.162   0.0408 *
factor(neighborhood)2  -24.800      9.644  -2.572   0.0167 *
factor(neighborhood)3  -17.775      9.317  -1.908   0.0684 .
factor(neighborhood)4  -21.250     10.206  -2.082   0.0482 *
factor(neighborhood)5  -30.650     12.199  -2.513   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.2 on 24 degrees of freedom
Multiple R-squared:  0.2636,    Adjusted R-squared:  0.1409 
F-statistic: 2.148 on 4 and 24 DF,  p-value: 0.1059

The model suggests that higher neighborhood rankings are associated with smaller sale-list price differences. Relative to Neighborhood 1, the coefficient for Neighborhood 5 is negative (-30.65), indicating that homes in the richest neighborhoods tend to sell for less relative to their asking price. The boxplot shows a similar pattern, with poorer neighborhoods generally exhibiting positive sale-list differences and wealthier neighborhoods showing negative differences. However, the overall regression model is not statistically significant (F = 2.148, p = 0.106), so the evidence is not strong enough to conclude that neighborhood rank alone reliably predicts the difference between sale price and list price. Therefore, the data do not support the claim that richer neighborhoods are more likely to have homes sell above the asking price.