Buttin Final Report

Author

Camille Buttin

Home Prices

Hello, and welcome to my final project report. I will be examining exciting world of home prices. Buckle in, it’s going to be a doozy.

First Steps

First, let’s load in the data:

homeprice = read.csv("homeprice.csv")

Now, let’s get an idea of what we are working with:

str(homeprice)
'data.frame':   29 obs. of  7 variables:
 $ list        : num  80 151 310 295 339 ...
 $ sale        : num  118 151 300 275 340 ...
 $ full        : int  1 1 2 2 2 1 3 1 1 1 ...
 $ half        : int  0 0 1 1 0 1 0 1 2 0 ...
 $ bedrooms    : int  3 4 4 4 3 4 3 3 3 1 ...
 $ rooms       : int  6 7 9 8 7 8 7 7 7 3 ...
 $ neighborhood: int  1 1 3 3 4 3 2 2 3 2 ...
names(homeprice)
[1] "list"         "sale"         "full"         "half"         "bedrooms"    
[6] "rooms"        "neighborhood"

Exploring Our Variables

We can now see the key variables of this dataset. To see which variables are most closely related to Sales Price (sale), let’s run a correlation matrix, only using numeric variables:

numeric_vars <- homeprice[, sapply(homeprice, is.numeric)]
cor_matrix <- cor(numeric_vars)
round(cor_matrix, 2)
             list sale  full  half bedrooms rooms neighborhood
list         1.00 0.99  0.65  0.39     0.48  0.63         0.88
sale         0.99 1.00  0.63  0.39     0.49  0.63         0.88
full         0.65 0.63  1.00 -0.12     0.32  0.40         0.62
half         0.39 0.39 -0.12  1.00     0.25  0.35         0.16
bedrooms     0.48 0.49  0.32  0.25     1.00  0.85         0.24
rooms        0.63 0.63  0.40  0.35     0.85  1.00         0.41
neighborhood 0.88 0.88  0.62  0.16     0.24  0.41         1.00

Now, let’s isolate the variables most closely correlated with sale:

sale_corr <- sort(cor_matrix["sale", ], decreasing = TRUE)
sale_corr
        sale         list neighborhood        rooms         full     bedrooms 
   1.0000000    0.9942086    0.8770245    0.6283765    0.6271649    0.4864766 
        half 
   0.3941621 

Based on these results, we can see that the variables most closely correlated with sale are list (listing price), neighborhood (neighborhood rank), rooms (the number of non-bedrooms), and full (the number of full bathrooms).

Let’s examine these relationships more closely with some plots. Wahoo!

Scatterplot: Sales Price x Listing Price

# Scatterplot: Sales vs List Price
library(ggplot2)
ggplot(homeprice, aes(x = list, y = sale)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Sales Price vs. Listing Price",
    x = "Listing Price (in $1000s)",
    y = "Sales Price (in $1000s)"
  )
`geom_smooth()` using formula = 'y ~ x'

As we can see from the line-of-best-fit, there is a nearly one-to-one correlation between Sales and Listing Price. Incredible! Let’s look at the other variables (and maybe change up our plots):

Boxplot(s): Sales Price x Neighborhood Rank

ggplot(homeprice, aes(x = as.factor(neighborhood), y = sale)) +
  geom_boxplot(fill = "pink") +
  labs(
    title = "Sales Price by Neighborhood Rank",
    x = "Neighborhood Rank (1 = Poor, 5 = Rich)",
    y = "Sales Price (in $1000s)"
  )

Again, we have a pretty close (but not perfect) correlation. What can we discover about sale and full?

Scatterplot: Sales Price x Number of Non-Bedrooms

ggplot(homeprice, aes(x = rooms, y = sale)) +
  geom_jitter(width = 0.3, height = 0, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(
    title = "Sales Price vs. Number of Non-Bedroom Rooms",
    x = "Number of Non-Bedroom Rooms",
    y = "Sales Price (in $1000s)"
  )
`geom_smooth()` using formula = 'y ~ x'

Here, we can still see a positive correlation, but this one is only moderately strong.

Modeling Sales Price

Now that we’ve explored which variables are most strongly correlated with sales price, let’s see how well they jointly explain variation in sales price. To do this, we will use a multiple linear regression model. We’ll include list, neighborhood, rooms, and full as predictors. After fitting the model, we’ll interpret the coefficients, evaluate goodness-of-fit, and use ANOVA to see which variable has the largest effect.

model_sale <- lm(sale ~ list + neighborhood + rooms + full, data = homeprice)
summary(model_sale)

Call:
lm(formula = sale ~ list + neighborhood + rooms + full, data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.970  -6.933  -0.429   4.924  33.842 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.96808   15.73481   0.316    0.755    
list          0.96659    0.05764  16.771 9.34e-15 ***
neighborhood  1.92858    5.78931   0.333    0.742    
rooms         0.65823    2.32324   0.283    0.779    
full         -4.31538    4.44447  -0.971    0.341    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.34 on 24 degrees of freedom
Multiple R-squared:  0.9889,    Adjusted R-squared:  0.9871 
F-statistic: 535.5 on 4 and 24 DF,  p-value: < 2.2e-16

From the output above, we can see that listing price is by far the strongest predictor of sale price, with a highly significant p-value (< 0.001) and a coefficient close to 1. The other predictors (neighborhood, rooms, and full baths) are not statistically significant in this model. The R-squared value is 0.989, meaning the model explains nearly 99% of the variation in sale price, indicating an excellent fit overall. These seem to align with our correlation analysis!

Now, let’s run an ANOVA:

anova(model_sale)
Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq   F value Pr(>F)    
list          1 381050  381050 2140.9284 <2e-16 ***
neighborhood  1      2       2    0.0121 0.9134    
rooms         1     11      11    0.0601 0.8085    
full          1    168     168    0.9428 0.3413    
Residuals    24   4272     178                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table above tells us which variable explains the most variation in sale price. Here, the results also appear to align with our previous output, with list coming in at #1, neighborhood following in suit, and rooms not far behind.

Residual Analysis

Now, let’s take a look at the distribution of the residuals. Ideally, residuals should be approximately normally distributed and centered around 0. This supports the assumptions of linear regression. To get a better idea of their distribution, let’s extract and plot them:

residuals_sale <- residuals(model_sale)

hist(residuals_sale,
     breaks = 20,
     col = "lightblue",
     freq = FALSE,  # Use density instead of counts
     main = "Histogram of Residuals with Normal Curve",
     xlab = "Residuals")

# Add normal curve
curve(dnorm(x, mean = mean(residuals_sale), sd = sd(residuals_sale)),
      col = "red",
      lwd = 2,
      add = TRUE)

As we can see, the residuals appear more or less normally distributed, suggesting that the model assumptions are satisfied.

Explaining List Price

Now, let’s apply the same modeling approach to the listing price instead of the sale price. We’ll use the same predictors: “neighborhood,” “rooms,” and “full” (but we will exclude “list,” since it’s now the outcome):

# Model for list price
model_list <- lm(list ~ neighborhood + rooms + full, data = homeprice)
summary(model_list)

Call:
lm(formula = list ~ neighborhood + rooms + full, data = homeprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-94.636 -26.574  -4.456  30.781  71.364 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -166.507     43.270  -3.848 0.000731 ***
neighborhood   83.056     11.298   7.351 1.06e-07 ***
rooms          24.299      6.432   3.778 0.000875 ***
full           14.882     15.133   0.983 0.334832    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 46.29 on 25 degrees of freedom
Multiple R-squared:  0.868, Adjusted R-squared:  0.8522 
F-statistic: 54.82 on 3 and 25 DF,  p-value: 3.894e-11

In this model, both neighborhood rank and number of rooms are statistically significant predictors (p < 0.001). Neighborhood has the strongest effect. Full bathrooms are not significant. The model explains about 87% of the variation in listing price (R² = 0.868), suggesting a strong overall fit.

Let’s examine the ANOVA table for this model to see if we can learn more:

anova(model_list)
Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq  F value    Pr(>F)    
neighborhood  1 315177  315177 147.0585 5.732e-12 ***
rooms         1  35233   35233  16.4396 0.0004306 ***
full          1   2073    2073   0.9671 0.3348317    
Residuals    25  53580    2143                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, the variables of neighborhood and rooms seem to have strong impacts on list price. So, we can reasonably guess that a real estate agent might focus more on neighborhood rank when setting the list price, while listing price itself will be the biggest predictor of actual sale price. Fascinating stuff!

Neighborhood effects on Sales vs. Listing Price

Finally, to explore how neighborhood rank influences whether homes sell above or below their asking price, we’ll calculate the difference between sale and list price for each home and compare those differences across neighborhoods.

Let’s start by creating a new variable which represents the difference between “sale” and “list”:

homeprice$diff <- homeprice$sale - homeprice$list

Now, let’s examine this difference with a box plot.

Boxplot of difference by neighborhood

ggplot(homeprice, aes(x = as.factor(neighborhood), y = diff)) +
  geom_boxplot(fill = "lightgreen") +
  labs(
    title = "Sales Price Minus Listing Price by Neighborhood",
    x = "Neighborhood Rank (1 = Poor, 5 = Rich)",
    y = "Sale - List Price (in $1000s)"
  )

Et voilà! Interestingly, from the boxplot, we can see that homes in poorer neighborhoods (rank 1) tend to sell well above their listing price, with a median difference of over 20,000 dollars. This could suggest that sellers are underpricing homes…or that these areas experience higher-than-expected demand.

In contrast, homes in the wealthiest neighborhoods (rank 5) often sell below their listing price, with a median shortfall around 15,000–20,000 dollars. This may reflect overambitious pricing or less competitive pressure in high-end markets. Neighborhoods ranked 2–4 show differences closer to zero, suggesting more balanced or accurate pricing overall.

Thank you for taking the time to read my final report. Ta ta!