FinalProject_Gordon

Author

Mason Gordon

GEOG 6680 Final Project

In this project we will be attempting to better understand the relationship between home characteristics and sale price using a dataset of homes sold in a town of New Jersey in 2001.

To begin we have to define our libraries and read in our data

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
Warning: package 'tidyr' was built under R version 4.4.2
homes <- read.csv("../data/homeprice.csv")

Question 1

  • Using this file, explore the relationship between the sale price and the other variables using scatterplots, histograms and/or boxplots. Identify those variables that appear to have the strongest relationship with sale price.

Histogram

Let’s begin with a simple histogram showing the spread of home prices in our dataset

ggplot(homes, aes(x = sale)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 20) +
  theme_minimal() +
  labs(title = "Distribution of Home Sale Prices",
       x = "Sale Price",
       y = "Number of Homes")

Boxplot

Now let’s see a boxplot showing sale price by its neighborhood rank

ggplot(homes, aes(x = as.factor(neighborhood), y = sale)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  theme_minimal() +
  labs(title = "Home Sale Prices Across Different Neighborhoods",
       x = "Neighborhood Rank",
       y = "Sale Price")

Scatterplot

Let’s now use scatterplots to take a look at how all the different variables relate to sale price

homes_long <- homes %>%
  pivot_longer(
    cols = c(list, full, half, bedrooms, rooms, neighborhood),
    names_to = "Predictor",
    values_to = "Value"
  )

ggplot(homes_long, aes(x = Value, y = sale)) +
  geom_point(alpha = 0.4, color = "darkblue") +  # Alpha adds transparency so overlapping points are visible
  geom_smooth(method = "lm", color = "red", se = FALSE) + # Adds a trendline to each plot
  facet_wrap(~ Predictor, scales = "free_x") +   # This creates the 6-panel grid!
  theme_minimal() +
  labs(
    title = "Sale Price vs. House Characteristics",
    subtitle = "Comparing all different variables against final sale price",
    x = "Value of Predictor",
    y = "Sale Price"
  ) +
  theme(
    strip.text = element_text(face = "bold", size = 12) # Makes the panel titles pop
  )
`geom_smooth()` using formula = 'y ~ x'

Upon a visual inspection of the plots, they all seem to have a fairly strong relationship. However I would say that list price has the strongest relationship, followed by neighborhood, rooms, and bedrooms. the full and half baths seems to have a relationship but not as strong. However this is just based on a visual inspection, some modeling will be required to determine if these are significant relationships or not.

Question 2

  • Now use these variables to build a multiple linear regression model to explain the sale price. Use the summary() function to find the coefficients and goodness-of-fit of the model. Use the anova() function to identify which variable appears to have the greatest effect on sale price. Remember to look at the distribution of residuals.

Let’s start with making the linear model

sale_model <- lm(sale ~ full + half + bedrooms + rooms + neighborhood, data = homes)

Now we can use the summary() function to find the coefficients and goodness-of-fit of the model

summary(sale_model)

Call:
lm(formula = sale ~ full + half + bedrooms + rooms + neighborhood, 
    data = homes)

Residuals:
   Min     1Q Median     3Q    Max 
-59.31 -34.06   7.20  21.32  55.93 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -135.263     37.283  -3.628  0.00141 ** 
full           26.225     13.896   1.887  0.07181 .  
half           43.242     12.830   3.370  0.00264 ** 
bedrooms       20.409     17.798   1.147  0.26329    
rooms           6.488     10.383   0.625  0.53823    
neighborhood   77.243     10.077   7.665 8.86e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.29 on 23 degrees of freedom
Multiple R-squared:  0.9079,    Adjusted R-squared:  0.8879 
F-statistic: 45.34 on 5 and 23 DF,  p-value: 3.686e-11

The linear model shows that neighborhood results in the greatest increase on the sale price. Interestingly, the next two most important coefficients are full and half baths. Personally I would have expected rooms and bedrooms to have a larger increase, but the model shows that this not the case.

The goodness-of-fit sits at 0.8879 meaning our dataset explains 88.8% of the sale price, this is proof of a strong model.

With that established we can move on to using anova to identify which variables have the greatest effect on sale price.

anova(sale_model)
Analysis of Variance Table

Response: sale
             Df Sum Sq Mean Sq F value    Pr(>F)    
full          1 151632  151632 98.2101 9.062e-10 ***
half          1  87430   87430 56.6271 1.206e-07 ***
bedrooms      1  10581   10581  6.8530   0.01538 *  
rooms         1   9632    9632  6.2387   0.02009 *  
neighborhood  1  90717   90717 58.7562 8.859e-08 ***
Residuals    23  35511    1544                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova table shows that full bathrooms has the greatest effect on the sale price, followed by neighborhood, and half baths. Again I am surprised that rooms and bedrooms aren’t higher on the list.

With all this said, let’s look at the residuals and see how well the model performed.

sale_residuals <- residuals(sale_model)

hist(sale_residuals, 
     breaks = 15, 
     col = "lightblue", 
     border = "black",
     main = "Histogram of Sale Model Residuals",
     xlab = "Residuals (Actual Price - Predicted Price)")

plot(sale_model, which = 2, main = "Normal Q-Q Plot of Residuals")

plot(sale_model, which = 1, main = "Residuals vs Fitted Values")

The small sample size makes it difficult to learn much from the histogram, however the Q-Q plot shows that our model does a good job with a few exceptions at the extremes. The residuals vs fitted plot reaffirms this as the model appears to predict too high for the cheapest and most expensive homes.

Question 3

  • Build a second model using the same variables to explain the list price. Use the anova() function to identify which variable appears to have the greatest effect on list price. Are there differences from the sale price? Could you use this information to recommend which characteristic of a house a real estate agent should concentrate on?

Now we will run the same analysis as before but switch sale price for list price

list_model <- lm(list ~ full + half + bedrooms + rooms + neighborhood, data = homes)

summary(list_model)

Call:
lm(formula = list ~ full + half + bedrooms + rooms + neighborhood, 
    data = homes)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.788 -28.776   4.351  23.859  62.720 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -144.544     36.026  -4.012 0.000546 ***
full           32.125     13.427   2.392 0.025293 *  
half           45.556     12.397   3.675 0.001257 ** 
bedrooms       18.446     17.197   1.073 0.294572    
rooms           7.126     10.033   0.710 0.484661    
neighborhood   77.430      9.737   7.952 4.75e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.97 on 23 degrees of freedom
Multiple R-squared:  0.9183,    Adjusted R-squared:  0.9006 
F-statistic: 51.74 on 5 and 23 DF,  p-value: 9.358e-12
anova(list_model)
Analysis of Variance Table

Response: list
             Df Sum Sq Mean Sq  F value    Pr(>F)    
full          1 169594  169594 117.6457 1.615e-10 ***
half          1  92249   92249  63.9922 4.294e-08 ***
bedrooms      1   9745    9745   6.7597   0.01601 *  
rooms         1  10162   10162   7.0494   0.01415 *  
neighborhood  1  91158   91158  63.2352 4.754e-08 ***
Residuals    23  33156    1442                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
list_residuals <- residuals(list_model)

hist(list_residuals, 
     breaks = 15, 
     col = "lightblue", 
     border = "black",
     main = "Histogram of list Model Residuals",
     xlab = "Residuals (Actual Price - Predicted Price)")

plot(list_model, which = 2, main = "Normal Q-Q Plot of Residuals")

plot(list_model, which = 1, main = "Residuals vs Fitted Values")

In comparing the sale and the list model, we see that both agree on neighborhood and bathrooms (half and full). The difference is in comparing estimates from the summary() function. Bathrooms seem to be overvalued, and bedrooms seem to be undervalued. This makes sense to me since I was saying earlier I was surprised by the value of bathrooms. This difference in estimate shows that there are difference in sale price predictions in the two models. I would recommend to a real estate agent to concentrate their prices on the neighborhood and half-baths.

Question 4

  • Finally, what is the effect of neighborhood on the difference between sale price and list price? Do richer neighborhoods mean it is more likely to have a house go over the asking price?

To do this, we will need to make a new variable that is the difference between sale price and list price and another one that gives a binary 0 or 1 to show if the sale price was over list price.

homes <- homes %>%
  mutate(price_diff = sale - list,
         over_asking = price_diff > 0)

Let’s see the result

neighborhood_summary <- homes %>%
  group_by(neighborhood) %>%
  summarize(
    avg_diff = mean(price_diff),
    percent_over_asking = mean(over_asking) * 100
  )
print(neighborhood_summary)
# A tibble: 5 × 3
  neighborhood avg_diff percent_over_asking
         <int>    <dbl>               <dbl>
1            1   18.6                  50  
2            2   -6.15                 12.5
3            3    0.875                41.7
4            4   -2.6                  40  
5            5  -12                    50  

We can also see it visually with a boxplot

ggplot(homes, aes(x = as.factor(neighborhood), y = price_diff)) +
  geom_boxplot(fill = "lightblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", size = 1) + 
  theme_minimal() +
  labs(title = "Sale vs. List Price Difference by Neighborhood",
       x = "Neighborhood Rank (1: Poor -> 5: Rich)",
       y = "Price Difference (Sale - List)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Interestingly enough, richer neighborhoods does not mean it is more likely to go over asking price. in fact the effect is the opposite, and poorer neighborhoods are more likely to go over asking price.