homeprice = read.csv("homeprice.csv")Buttin Final Report
Home Prices
Hello, and welcome to my final project report. I will be examining exciting world of home prices. Buckle in, it’s going to be a doozy.
First Steps
First, let’s load in the data:
Now, let’s get an idea of what we are working with:
str(homeprice)'data.frame': 29 obs. of 7 variables:
$ list : num 80 151 310 295 339 ...
$ sale : num 118 151 300 275 340 ...
$ full : int 1 1 2 2 2 1 3 1 1 1 ...
$ half : int 0 0 1 1 0 1 0 1 2 0 ...
$ bedrooms : int 3 4 4 4 3 4 3 3 3 1 ...
$ rooms : int 6 7 9 8 7 8 7 7 7 3 ...
$ neighborhood: int 1 1 3 3 4 3 2 2 3 2 ...
names(homeprice)[1] "list" "sale" "full" "half" "bedrooms"
[6] "rooms" "neighborhood"
Exploring Our Variables
We can now see the key variables of this dataset. To see which variables are most closely related to Sales Price (sale), let’s run a correlation matrix, only using numeric variables:
numeric_vars <- homeprice[, sapply(homeprice, is.numeric)]
cor_matrix <- cor(numeric_vars)
round(cor_matrix, 2) list sale full half bedrooms rooms neighborhood
list 1.00 0.99 0.65 0.39 0.48 0.63 0.88
sale 0.99 1.00 0.63 0.39 0.49 0.63 0.88
full 0.65 0.63 1.00 -0.12 0.32 0.40 0.62
half 0.39 0.39 -0.12 1.00 0.25 0.35 0.16
bedrooms 0.48 0.49 0.32 0.25 1.00 0.85 0.24
rooms 0.63 0.63 0.40 0.35 0.85 1.00 0.41
neighborhood 0.88 0.88 0.62 0.16 0.24 0.41 1.00
Now, let’s isolate the variables most closely correlated with sale:
sale_corr <- sort(cor_matrix["sale", ], decreasing = TRUE)
sale_corr sale list neighborhood rooms full bedrooms
1.0000000 0.9942086 0.8770245 0.6283765 0.6271649 0.4864766
half
0.3941621
Based on these results, we can see that the variables most closely correlated with sale are list (listing price), neighborhood (neighborhood rank), rooms (the number of non-bedrooms), and full (the number of full bathrooms).
Let’s examine these relationships more closely with some plots. Wahoo!
Scatterplot: Sales Price x Listing Price
# Scatterplot: Sales vs List Price
library(ggplot2)
ggplot(homeprice, aes(x = list, y = sale)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Sales Price vs. Listing Price",
x = "Listing Price (in $1000s)",
y = "Sales Price (in $1000s)"
)`geom_smooth()` using formula = 'y ~ x'
As we can see from the line-of-best-fit, there is a nearly one-to-one correlation between Sales and Listing Price. Incredible! Let’s look at the other variables (and maybe change up our plots):
Boxplot(s): Sales Price x Neighborhood Rank
ggplot(homeprice, aes(x = as.factor(neighborhood), y = sale)) +
geom_boxplot(fill = "pink") +
labs(
title = "Sales Price by Neighborhood Rank",
x = "Neighborhood Rank (1 = Poor, 5 = Rich)",
y = "Sales Price (in $1000s)"
)Again, we have a pretty close (but not perfect) correlation. What can we discover about sale and full?
Scatterplot: Sales Price x Number of Non-Bedrooms
ggplot(homeprice, aes(x = rooms, y = sale)) +
geom_jitter(width = 0.3, height = 0, alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
labs(
title = "Sales Price vs. Number of Non-Bedroom Rooms",
x = "Number of Non-Bedroom Rooms",
y = "Sales Price (in $1000s)"
)`geom_smooth()` using formula = 'y ~ x'
Here, we can still see a positive correlation, but this one is only moderately strong.
Modeling Sales Price
Now that we’ve explored which variables are most strongly correlated with sales price, let’s see how well they jointly explain variation in sales price. To do this, we will use a multiple linear regression model. We’ll include list, neighborhood, rooms, and full as predictors. After fitting the model, we’ll interpret the coefficients, evaluate goodness-of-fit, and use ANOVA to see which variable has the largest effect.
model_sale <- lm(sale ~ list + neighborhood + rooms + full, data = homeprice)
summary(model_sale)
Call:
lm(formula = sale ~ list + neighborhood + rooms + full, data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-27.970 -6.933 -0.429 4.924 33.842
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.96808 15.73481 0.316 0.755
list 0.96659 0.05764 16.771 9.34e-15 ***
neighborhood 1.92858 5.78931 0.333 0.742
rooms 0.65823 2.32324 0.283 0.779
full -4.31538 4.44447 -0.971 0.341
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.34 on 24 degrees of freedom
Multiple R-squared: 0.9889, Adjusted R-squared: 0.9871
F-statistic: 535.5 on 4 and 24 DF, p-value: < 2.2e-16
From the output above, we can see that listing price is by far the strongest predictor of sale price, with a highly significant p-value (< 0.001) and a coefficient close to 1. The other predictors (neighborhood, rooms, and full baths) are not statistically significant in this model. The R-squared value is 0.989, meaning the model explains nearly 99% of the variation in sale price, indicating an excellent fit overall. These seem to align with our correlation analysis!
Now, let’s run an ANOVA:
anova(model_sale)Analysis of Variance Table
Response: sale
Df Sum Sq Mean Sq F value Pr(>F)
list 1 381050 381050 2140.9284 <2e-16 ***
neighborhood 1 2 2 0.0121 0.9134
rooms 1 11 11 0.0601 0.8085
full 1 168 168 0.9428 0.3413
Residuals 24 4272 178
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table above tells us which variable explains the most variation in sale price. Here, the results also appear to align with our previous output, with list coming in at #1, neighborhood following in suit, and rooms not far behind.
Residual Analysis
Now, let’s take a look at the distribution of the residuals. Ideally, residuals should be approximately normally distributed and centered around 0. This supports the assumptions of linear regression. To get a better idea of their distribution, let’s extract and plot them:
residuals_sale <- residuals(model_sale)
hist(residuals_sale,
breaks = 20,
col = "lightblue",
freq = FALSE, # Use density instead of counts
main = "Histogram of Residuals with Normal Curve",
xlab = "Residuals")
# Add normal curve
curve(dnorm(x, mean = mean(residuals_sale), sd = sd(residuals_sale)),
col = "red",
lwd = 2,
add = TRUE)As we can see, the residuals appear more or less normally distributed, suggesting that the model assumptions are satisfied.
Explaining List Price
Now, let’s apply the same modeling approach to the listing price instead of the sale price. We’ll use the same predictors: “neighborhood,” “rooms,” and “full” (but we will exclude “list,” since it’s now the outcome):
# Model for list price
model_list <- lm(list ~ neighborhood + rooms + full, data = homeprice)
summary(model_list)
Call:
lm(formula = list ~ neighborhood + rooms + full, data = homeprice)
Residuals:
Min 1Q Median 3Q Max
-94.636 -26.574 -4.456 30.781 71.364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -166.507 43.270 -3.848 0.000731 ***
neighborhood 83.056 11.298 7.351 1.06e-07 ***
rooms 24.299 6.432 3.778 0.000875 ***
full 14.882 15.133 0.983 0.334832
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 46.29 on 25 degrees of freedom
Multiple R-squared: 0.868, Adjusted R-squared: 0.8522
F-statistic: 54.82 on 3 and 25 DF, p-value: 3.894e-11
In this model, both neighborhood rank and number of rooms are statistically significant predictors (p < 0.001). Neighborhood has the strongest effect. Full bathrooms are not significant. The model explains about 87% of the variation in listing price (R² = 0.868), suggesting a strong overall fit.
Let’s examine the ANOVA table for this model to see if we can learn more:
anova(model_list)Analysis of Variance Table
Response: list
Df Sum Sq Mean Sq F value Pr(>F)
neighborhood 1 315177 315177 147.0585 5.732e-12 ***
rooms 1 35233 35233 16.4396 0.0004306 ***
full 1 2073 2073 0.9671 0.3348317
Residuals 25 53580 2143
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see, the variables of neighborhood and rooms seem to have strong impacts on list price. So, we can reasonably guess that a real estate agent might focus more on neighborhood rank when setting the list price, while listing price itself will be the biggest predictor of actual sale price. Fascinating stuff!
Neighborhood effects on Sales vs. Listing Price
Finally, to explore how neighborhood rank influences whether homes sell above or below their asking price, we’ll calculate the difference between sale and list price for each home and compare those differences across neighborhoods.
Let’s start by creating a new variable which represents the difference between “sale” and “list”:
homeprice$diff <- homeprice$sale - homeprice$listNow, let’s examine this difference with a box plot.
Boxplot of difference by neighborhood
ggplot(homeprice, aes(x = as.factor(neighborhood), y = diff)) +
geom_boxplot(fill = "lightgreen") +
labs(
title = "Sales Price Minus Listing Price by Neighborhood",
x = "Neighborhood Rank (1 = Poor, 5 = Rich)",
y = "Sale - List Price (in $1000s)"
)Et voilà! Interestingly, from the boxplot, we can see that homes in poorer neighborhoods (rank 1) tend to sell well above their listing price, with a median difference of over 20,000 dollars. This could suggest that sellers are underpricing homes…or that these areas experience higher-than-expected demand.
In contrast, homes in the wealthiest neighborhoods (rank 5) often sell below their listing price, with a median shortfall around 15,000–20,000 dollars. This may reflect overambitious pricing or less competitive pressure in high-end markets. Neighborhoods ranked 2–4 show differences closer to zero, suggesting more balanced or accurate pricing overall.
Thank you for taking the time to read my final report. Ta ta!