Final_Project_Part 1

Author

Cienna Kim

Published

June 18, 2026

Introduction

The dataset homeprice.csv contains information on houses sold in a New Jersey town during 2001. Variables include sale price, list price, number of bathrooms, bedrooms, rooms, and neighborhood rank. The goal of this analysis is to identify which house characteristics are most strongly related to sale price and list price, and to investigate whether neighborhood influences the difference between sale price and list price.

Data Exploration

First, the structure and summary statistics of the dataset were examined.

homeprice <- read.csv("homeprice.csv")

str(homeprice)
'data.frame':   29 obs. of  7 variables:
 $ list        : num  80 151 310 295 339 ...
 $ sale        : num  118 151 300 275 340 ...
 $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
 $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
 $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
 $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
 $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...
summary(homeprice)
      list            sale            full            half       
 Min.   : 43.0   Min.   : 48.0   Min.   :1.000   Min.   :0.0000  
 1st Qu.:189.0   1st Qu.:185.0   1st Qu.:1.000   1st Qu.:0.0000  
 Median :275.0   Median :272.5   Median :2.000   Median :1.0000  
 Mean   :274.8   Mean   :273.5   Mean   :1.724   Mean   :0.6552  
 3rd Qu.:339.0   3rd Qu.:340.0   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :599.0   Max.   :613.0   Max.   :3.000   Max.   :2.0000  
    bedrooms         rooms         neighborhood  
 Min.   :1.000   Min.   : 3.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.: 7.000   1st Qu.:2.000  
 Median :3.000   Median : 7.000   Median :3.000  
 Mean   :3.172   Mean   : 7.207   Mean   :2.897  
 3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:3.000  
 Max.   :5.000   Max.   :11.000   Max.   :5.000  

Relationship Between Sale Price and Other Variables

Scatterplots were used to examine the relationship between sale and the numerical variables.

plot(homeprice$list, homeprice$sale,
     xlab = "List Price",
     ylab = "Sale Price")

plot(homeprice$full, homeprice$sale,
     xlab = "Full Bathrooms",
     ylab = "Sale Price")

plot(homeprice$half, homeprice$sale,
     xlab = "Half Bathrooms",
     ylab = "Sale Price")

plot(homeprice$bedrooms, homeprice$sale,
     xlab = "Bedrooms",
     ylab = "Sale Price")

plot(homeprice$rooms, homeprice$sale,
     xlab = "Rooms",
     ylab = "Sale Price")

A boxplot was used to investigate the effect of neighborhood rank on sale price.

boxplot(sale ~ factor(neighborhood),
        data = homeprice,
        xlab = "Neighborhood",
        ylab = "Sale Price")

Histograms were also examined to evaluate the distributions of sale and list prices.

hist(homeprice$sale,
     main = "Sale Price",
     xlab = "Sale Price")

hist(homeprice$list,
     main = "List Price",
     xlab = "List Price")

Findings from Exploratory Analysis

The strongest relationship with sale price was observed for list price. The scatterplot showed a very strong positive linear relationship between sale and list.

Neighborhood also appeared to have a strong relationship with sale price. Houses in higher-ranked neighborhoods generally sold for higher prices than houses in lower-ranked neighborhoods.

The relationships between sale price and the numbers of bathrooms, bedrooms, and rooms were weaker.

Multiple Linear Regression for Sale Price

A multiple linear regression model was constructed using the available house characteristics.

homeprice$neighborhood <- factor(homeprice$neighborhood)

m_sale <- lm(
  sale ~ list + full + half + bedrooms + rooms + neighborhood,
  data = homeprice
)

summary(m_sale)

Call:
lm(formula = sale ~ list + full + half + bedrooms + rooms + neighborhood, 
    data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.149  -6.679   1.486   4.364  24.149 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    27.96941   15.84528   1.765   0.0936 .  
list            0.91554    0.07965  11.494 5.34e-10 ***
full           -0.04073    5.49422  -0.007   0.9942    
half            5.07142    6.08110   0.834   0.4147    
bedrooms       -4.39758    6.64979  -0.661   0.5164    
rooms           2.44383    3.66230   0.667   0.5126    
neighborhood2 -26.11476   12.69888  -2.056   0.0537 .  
neighborhood3 -11.42126   14.32365  -0.797   0.4351    
neighborhood4  -4.41077   20.66629  -0.213   0.8333    
neighborhood5  -4.39556   30.46718  -0.144   0.8868    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.86 on 19 degrees of freedom
Multiple R-squared:  0.9919,    Adjusted R-squared:  0.988 
F-statistic: 257.1 on 9 and 19 DF,  p-value: < 2.2e-16

ANOVA

The ANOVA table was used to determine which variable had the greatest effect on sale price.

anova(m_sale)
Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq   F value Pr(>F)    
list          1 381050  381050 2305.6632 <2e-16 ***
full          1    156     156    0.9443 0.3434    
half          1     21      21    0.1271 0.7254    
bedrooms      1     25      25    0.1529 0.7001    
rooms         1      3       3    0.0164 0.8994    
neighborhood  4   1107     277    1.6748 0.1973    
Residuals    19   3140     165                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual Diagnostics

Residual plots were examined to evaluate model assumptions.

plot(m_sale)

Interpretation

The model explained approximately 99% of the variation in sale price (R² ≈ 0.99).

The coefficient for list price was highly significant, indicating that sale price increases as list price increases.

According to the ANOVA results, neighborhood had the largest effect on sale price.

The residual plots indicated some violations of model assumptions. The Normal Q-Q and Leverage plots revealed that observations 12, 14, and 19 act as significant outliers and influential points.

Multiple Linear Regression for List Price

A second model was created using the same explanatory variables to explain list price.

m_list <- lm(
  list ~ sale + full + half + bedrooms + rooms + neighborhood,
  data = homeprice
)

summary(m_list)

Call:
lm(formula = list ~ sale + full + half + bedrooms + rooms + neighborhood, 
    data = homeprice)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.7804  -6.5758   0.6545   6.2554  24.7804 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -28.00292   16.23412  -1.725   0.1008    
sale            0.95493    0.08308  11.494 5.34e-10 ***
full            3.72018    5.54588   0.671   0.5104    
half            1.45482    6.31436   0.230   0.8202    
bedrooms        5.51946    6.75132   0.818   0.4238    
rooms          -1.17382    3.77423  -0.311   0.7592    
neighborhood2  28.10130   12.80917   2.194   0.0409 *  
neighborhood3  23.60418   13.85028   1.704   0.1046    
neighborhood4  27.96652   20.13387   1.389   0.1809    
neighborhood5  42.00318   29.60384   1.419   0.1721    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.13 on 19 degrees of freedom
Multiple R-squared:  0.9919,    Adjusted R-squared:  0.9881 
F-statistic: 259.6 on 9 and 19 DF,  p-value: < 2.2e-16

ANOVA

anova(m_list)
Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq   F value Pr(>F)    
sale          1 401374  401374 2328.4670 <2e-16 ***
full          1    346     346    2.0059 0.1729    
half          1    134     134    0.7749 0.3897    
bedrooms      1      4       4    0.0217 0.8843    
rooms         1     24      24    0.1382 0.7142    
neighborhood  4    908     227    1.3164 0.2997    
Residuals    19   3275     172                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

According to the ANOVA results, neighborhood had the largest effect on list price.

In the sale price model, list is the strongest individual predictor in summary(), while neighborhood shows the largest overall effect in anova(). In the list price model, sale is the strongest individual predictor, and neighborhood2 is the only neighborhood level with a significant coefficient.

Real estate agents should pay close attention to neighborhood rank, but the model is also dominated by the very strong relationship between sale price and list price.

Difference Between Sale Price and List Price

To investigate whether neighborhood affects the difference between sale price and list price, a new variable was created.

homeprice$diff <- homeprice$sale - homeprice$list

summary(homeprice$diff)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-38.000  -6.000   0.000  -1.324   2.000  37.700 

A boxplot was used to compare the differences among neighborhoods.

boxplot(diff ~ neighborhood,
        data = homeprice,
        xlab = "Neighborhood",
        ylab = "Sale Price - List Price")

A simple linear model was fitted.

m_diff <- lm(diff ~ neighborhood, data = homeprice)

summary(m_diff)

Call:
lm(formula = diff ~ neighborhood, data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.000  -4.875   0.150   5.125  26.000 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)     18.650      8.626   2.162   0.0408 *
neighborhood2  -24.800      9.644  -2.572   0.0167 *
neighborhood3  -17.775      9.317  -1.908   0.0684 .
neighborhood4  -21.250     10.206  -2.082   0.0482 *
neighborhood5  -30.650     12.199  -2.513   0.0191 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.2 on 24 degrees of freedom
Multiple R-squared:  0.2636,    Adjusted R-squared:  0.1409 
F-statistic: 2.148 on 4 and 24 DF,  p-value: 0.1059
anova(m_diff)
Analysis of Variance Table

Response: diff
             Df Sum Sq Mean Sq F value Pr(>F)
neighborhood  4 1278.4  319.59  2.1477 0.1059
Residuals    24 3571.4  148.81               

Interpretation

Neighborhood does not show strong evidence of affecting the difference between sale price and list price.

The average difference between sale price and list price was not larger in richer neighborhoods. In fact, houses in the highest-ranked neighborhoods tended to sell slightly below the asking price, while houses in lower-ranked neighborhoods were somewhat more likely to sell above the asking price.

Therefore, richer neighborhoods do not appear to be more likely to have houses sell above the asking price.

Conclusion

This analysis examined the relationship between housing characteristics and both sale price and list price.

The strongest relationship with sale price was observed for list price. Multiple linear regression indicated that the model explained approximately 99% of the variation in sale price.

Both models show an excellent fit where list and sale prices heavily dictate each other, but neighborhood rank is critical determinant of overall house value. While physical features remain insignificant, neighborhood quality plays a key role in setting initial asking prices (p=0.0409 for neighborhood2). Ultimately, agents should focus on location quality over physical modifications.

Finally, richer neighborhoods were not more likely to have houses sell above the asking price. The observed differences between sale price and list price were relatively small and did not increase consistently with neighborhood rank.