Introduction

The Brooklyn housing market is an interesting case of rapid price changes, reflecting broader shifts in the real estate market across New York City. As someone who grew up in Brooklyn, I’ve seen firsthand how the value of homes in my neighborhood has changed dramatically over the years. Gentrification, changing demographics, and economic shifts have all contributed to fluctuating house prices.

In this report, I explore two key statistical models—Polynomial Regression and Multiple Linear Regression (MLR)—to understand how various factors influence housing prices. I’ll apply these models to a dataset of homes from New York and discuss how they can be used to predict house prices.


Exploratory Data Analysis (EDA)

Before jumping into model building, it’s essential to understand the dataset. I’ll start with some basic summary statistics, explore the distributions of key variables, and analyze the correlations between variables.

# Load the necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)

# Load the dataset
url <- 'https://raw.githubusercontent.com/Meccamarshall/Data621/refs/heads/main/Homework2/NY-House-Dataset.csv'
housing_data <- read_csv(url)

# Clean the data
housing_data_clean <- housing_data %>%
  filter(!is.na(PRICE) & !is.na(PROPERTYSQFT))

# Summary statistics
summary(housing_data_clean)
##  BROKERTITLE            TYPE               PRICE                BEDS       
##  Length:4801        Length:4801        Min.   :2.494e+03   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:4.990e+05   1st Qu.: 2.000  
##  Mode  :character   Mode  :character   Median :8.250e+05   Median : 3.000  
##                                        Mean   :2.357e+06   Mean   : 3.357  
##                                        3rd Qu.:1.495e+06   3rd Qu.: 4.000  
##                                        Max.   :2.147e+09   Max.   :50.000  
##       BATH         PROPERTYSQFT     ADDRESS             STATE          
##  Min.   : 0.000   Min.   :  230   Length:4801        Length:4801       
##  1st Qu.: 1.000   1st Qu.: 1200   Class :character   Class :character  
##  Median : 2.000   Median : 2184   Mode  :character   Mode  :character  
##  Mean   : 2.374   Mean   : 2184                                        
##  3rd Qu.: 3.000   3rd Qu.: 2184                                        
##  Max.   :50.000   Max.   :65535                                        
##  MAIN_ADDRESS       ADMINISTRATIVE_AREA_LEVEL_2   LOCALITY        
##  Length:4801        Length:4801                 Length:4801       
##  Class :character   Class :character            Class :character  
##  Mode  :character   Mode  :character            Mode  :character  
##                                                                   
##                                                                   
##                                                                   
##  SUBLOCALITY        STREET_NAME         LONG_NAME         FORMATTED_ADDRESS 
##  Length:4801        Length:4801        Length:4801        Length:4801       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     LATITUDE       LONGITUDE     
##  Min.   :40.50   Min.   :-74.25  
##  1st Qu.:40.64   1st Qu.:-73.99  
##  Median :40.73   Median :-73.95  
##  Mean   :40.71   Mean   :-73.94  
##  3rd Qu.:40.77   3rd Qu.:-73.87  
##  Max.   :40.91   Max.   :-73.70
# Visualizing the distribution of house prices
ggplot(housing_data_clean, aes(x = log(PRICE))) +
  geom_histogram(binwidth = 0.5, fill = "pink", color = "white") +
  labs(title = "Distribution of House Prices (Log-Transformed)", x = "Log(Price)", y = "Count") +
  theme_minimal()

# Correct the column names to match your dataset (removed extra comma)
cor_matrix <- cor(housing_data_clean %>% select(PRICE, PROPERTYSQFT, BATH))

# Generate the correlation plot
corrplot(cor_matrix, method = "circle", 
         col = colorRampPalette(c("white", "pink", "deeppink"))(200), 
         tl.col = "red", # Adjusting the text color to red if you prefer
         tl.cex = 0.8)

Interpretation of EDA

The distribution of house prices shows a wide range, with most homes clustering between $200,000 and $1,000,000. The correlation matrix reveals strong relationships between square footage and price, as well as a notable correlation between bedrooms and price.


Concept 1: Polynomial Regression

Definition and Motivation

Polynomial regression is useful for capturing nonlinear relationships. In the context of the housing market, square footage doesn’t always have a straightforward, linear relationship with price. We might observe diminishing returns as square footage increases.

The general equation is:

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \dots + \beta_n X^n + \epsilon \]

Applying Polynomial Regression

We’ll explore different polynomial degrees (2nd, 3rd, and 4th) to see how well they model the relationship between square footage and price.

# Fit polynomial models of different degrees using the correct column names
poly_model_2 <- lm(PRICE ~ poly(PROPERTYSQFT, 2), data = housing_data_clean)
poly_model_3 <- lm(PRICE ~ poly(PROPERTYSQFT, 3), data = housing_data_clean)
poly_model_4 <- lm(PRICE ~ poly(PROPERTYSQFT, 4), data = housing_data_clean)

# Summary of the models
summary(poly_model_2)
## 
## Call:
## lm(formula = PRICE ~ poly(PROPERTYSQFT, 2), data = housing_data_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -40198049   -2088098    -690444    1584970 2124547650 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2356940     448029   5.261 1.50e-07 ***
## poly(PROPERTYSQFT, 2)1  240889791   31043566   7.760 1.03e-14 ***
## poly(PROPERTYSQFT, 2)2 -193008624   31043566  -6.217 5.49e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31040000 on 4798 degrees of freedom
## Multiple R-squared:  0.02019,    Adjusted R-squared:  0.01978 
## F-statistic: 49.43 on 2 and 4798 DF,  p-value: < 2.2e-16
summary(poly_model_3)
## 
## Call:
## lm(formula = PRICE ~ poly(PROPERTYSQFT, 3), data = housing_data_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -36285179   -2155628    -737991    1700763 2124460752 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2356940     448056   5.260 1.50e-07 ***
## poly(PROPERTYSQFT, 3)1  240889791   31045493   7.759 1.04e-14 ***
## poly(PROPERTYSQFT, 3)2 -193008624   31045493  -6.217 5.50e-10 ***
## poly(PROPERTYSQFT, 3)3   19741253   31045493   0.636    0.525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31050000 on 4797 degrees of freedom
## Multiple R-squared:  0.02027,    Adjusted R-squared:  0.01966 
## F-statistic: 33.09 on 3 and 4797 DF,  p-value: < 2.2e-16
summary(poly_model_4)
## 
## Call:
## lm(formula = PRICE ~ poly(PROPERTYSQFT, 4), data = housing_data_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -51881728   -1618624    -398736     981415 2120706746 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2356940     447380   5.268 1.44e-07 ***
## poly(PROPERTYSQFT, 4)1  240889791   30998589   7.771 9.46e-15 ***
## poly(PROPERTYSQFT, 4)2 -193008624   30998589  -6.226 5.18e-10 ***
## poly(PROPERTYSQFT, 4)3   19741253   30998589   0.637    0.524    
## poly(PROPERTYSQFT, 4)4  122150360   30998589   3.941 8.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.1e+07 on 4796 degrees of freedom
## Multiple R-squared:  0.02343,    Adjusted R-squared:  0.02262 
## F-statistic: 28.77 on 4 and 4796 DF,  p-value: < 2.2e-16
# Visualize polynomial fits using correct column names
ggplot(housing_data_clean, aes(x = PROPERTYSQFT, y = PRICE)) +
  geom_point(color = "pink") +  # Color the points pink
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), col = "hotpink") +  # Polynomial fit degree 2 in pink
  stat_smooth(method = "lm", formula = y ~ poly(x, 3), col = "deeppink") +  # Polynomial fit degree 3 in deeper pink
  labs(title = "Polynomial Regression: Different Degrees of Fit",
       x = "Property Square Footage",
       y = "Price") +
  theme_minimal()

Interpretation of Polynomial Models

The quadratic model (2nd-degree polynomial) fits the data well and captures the trend of diminishing returns. The 3rd and 4th-degree models introduce more flexibility, but they also risk overfitting. The red and green curves show that higher-degree models can capture more nuances but may fit noise rather than true patterns in the data.


Concept 2: Multiple Linear Regression (MLR)

Definition and Motivation

Multiple linear regression allows us to account for more than one independent variable. In this case, we’ll include bedrooms and bathrooms alongside square footage to predict house prices. The general equation is:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \epsilon \]

Applying MLR with Interaction Terms

In addition to the basic MLR, we’ll also explore interaction terms. For example, the interaction between square footage and bedrooms may have a combined effect on price.

# Fit the multiple linear regression model with interaction terms using correct column names
mlr_model_interaction <- lm(PRICE ~ PROPERTYSQFT * BEDS + BATH, data = housing_data_clean)

# Summary of the model
summary(mlr_model_interaction)
## 
## Call:
## lm(formula = PRICE ~ PROPERTYSQFT * BEDS + BATH, data = housing_data_clean)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -66603880   -1390611    -607067     566921 2132939346 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.363e+06  9.998e+05  -2.363  0.01814 *  
## PROPERTYSQFT       1.435e+03  2.376e+02   6.041 1.64e-09 ***
## BEDS              -2.325e+05  2.971e+05  -0.783  0.43390    
## BATH               1.164e+06  4.010e+05   2.902  0.00372 ** 
## PROPERTYSQFT:BEDS -4.001e+01  2.427e+01  -1.649  0.09924 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31140000 on 4796 degrees of freedom
## Multiple R-squared:  0.01419,    Adjusted R-squared:  0.01337 
## F-statistic: 17.26 on 4 and 4796 DF,  p-value: 4.555e-14
# Diagnostic plots with pink color
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 layout
plot(mlr_model_interaction, col = "deeppink", pch = 16)  # Set color to pink and point type to solid circle

Checking for Multicollinearity

Multicollinearity can distort the coefficients of the regression model. We’ll use the Variance Inflation Factor (VIF) to check for multicollinearity.

# Install and load car package for VIF
if (!require(car)) install.packages("car")
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(car)

# Checking for multicollinearity using type = "predictor" to handle interaction terms
vif(mlr_model_interaction, type = "predictor")
## GVIFs computed for predictors
##                 GVIF Df GVIF^(1/(2*Df)) Interacts With   Other Predictors
## PROPERTYSQFT 3.01655  3        1.202039           BEDS               BATH
## BEDS         3.01655  3        1.202039   PROPERTYSQFT               BATH
## BATH         3.01655  1        1.736822           --   PROPERTYSQFT, BEDS

Interpretation of MLR Results

The MLR model shows that square footage is the strongest predictor of price, but the interaction between bedrooms and square footage also plays a role. For example, larger homes with more bedrooms tend to be priced higher, but there are diminishing returns when the home becomes too large.

The VIF values suggest no significant multicollinearity between the predictors, which means our model is reliable.


Model Diagnostics

Polynomial Model Diagnostics

We’ll check the assumptions of polynomial regression by looking at the residuals to ensure there are no patterns that suggest the model is missing key information.

# Residual diagnostics for the polynomial model with pink colors
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 layout
plot(poly_model_2, col = "deeppink", pch = 16)  # Set color to pink and point type to solid circle

MLR Model Diagnostics

We’ll also check the residuals of the MLR model to ensure that the assumptions of homoscedasticity and normality are met.

# Residual diagnostics for the MLR model with pink colors
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 layout
plot(mlr_model_interaction, col = "deeppink", pch = 16)  # Set color to pink and point type to solid circle


Conclusion

Through this analysis, we’ve seen how both polynomial regression and multiple linear regression can be used to predict house prices. Polynomial regression is helpful when dealing with nonlinear relationships, such as the diminishing returns of square footage. Multiple linear regression provides a broader view by incorporating several variables that influence house prices, like bedrooms and bathrooms.

For someone like me, who grew up watching the housing market in Brooklyn, these models provide a clearer understanding of how homes are valued. Whether you’re a buyer, seller, or investor, having this knowledge can help you make more informed decisions in a competitive market like New York City.