The Brooklyn housing market is an interesting case of rapid price changes, reflecting broader shifts in the real estate market across New York City. As someone who grew up in Brooklyn, I’ve seen firsthand how the value of homes in my neighborhood has changed dramatically over the years. Gentrification, changing demographics, and economic shifts have all contributed to fluctuating house prices.
In this report, I explore two key statistical models—Polynomial Regression and Multiple Linear Regression (MLR)—to understand how various factors influence housing prices. I’ll apply these models to a dataset of homes from New York and discuss how they can be used to predict house prices.
Before jumping into model building, it’s essential to understand the dataset. I’ll start with some basic summary statistics, explore the distributions of key variables, and analyze the correlations between variables.
# Load the necessary libraries
library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)
# Load the dataset
url <- 'https://raw.githubusercontent.com/Meccamarshall/Data621/refs/heads/main/Homework2/NY-House-Dataset.csv'
housing_data <- read_csv(url)
# Clean the data
housing_data_clean <- housing_data %>%
filter(!is.na(PRICE) & !is.na(PROPERTYSQFT))
# Summary statistics
summary(housing_data_clean)
## BROKERTITLE TYPE PRICE BEDS
## Length:4801 Length:4801 Min. :2.494e+03 Min. : 1.000
## Class :character Class :character 1st Qu.:4.990e+05 1st Qu.: 2.000
## Mode :character Mode :character Median :8.250e+05 Median : 3.000
## Mean :2.357e+06 Mean : 3.357
## 3rd Qu.:1.495e+06 3rd Qu.: 4.000
## Max. :2.147e+09 Max. :50.000
## BATH PROPERTYSQFT ADDRESS STATE
## Min. : 0.000 Min. : 230 Length:4801 Length:4801
## 1st Qu.: 1.000 1st Qu.: 1200 Class :character Class :character
## Median : 2.000 Median : 2184 Mode :character Mode :character
## Mean : 2.374 Mean : 2184
## 3rd Qu.: 3.000 3rd Qu.: 2184
## Max. :50.000 Max. :65535
## MAIN_ADDRESS ADMINISTRATIVE_AREA_LEVEL_2 LOCALITY
## Length:4801 Length:4801 Length:4801
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## SUBLOCALITY STREET_NAME LONG_NAME FORMATTED_ADDRESS
## Length:4801 Length:4801 Length:4801 Length:4801
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## LATITUDE LONGITUDE
## Min. :40.50 Min. :-74.25
## 1st Qu.:40.64 1st Qu.:-73.99
## Median :40.73 Median :-73.95
## Mean :40.71 Mean :-73.94
## 3rd Qu.:40.77 3rd Qu.:-73.87
## Max. :40.91 Max. :-73.70
# Visualizing the distribution of house prices
ggplot(housing_data_clean, aes(x = log(PRICE))) +
geom_histogram(binwidth = 0.5, fill = "pink", color = "white") +
labs(title = "Distribution of House Prices (Log-Transformed)", x = "Log(Price)", y = "Count") +
theme_minimal()
# Correct the column names to match your dataset (removed extra comma)
cor_matrix <- cor(housing_data_clean %>% select(PRICE, PROPERTYSQFT, BATH))
# Generate the correlation plot
corrplot(cor_matrix, method = "circle",
col = colorRampPalette(c("white", "pink", "deeppink"))(200),
tl.col = "red", # Adjusting the text color to red if you prefer
tl.cex = 0.8)
The distribution of house prices shows a wide range, with most homes clustering between $200,000 and $1,000,000. The correlation matrix reveals strong relationships between square footage and price, as well as a notable correlation between bedrooms and price.
Polynomial regression is useful for capturing nonlinear relationships. In the context of the housing market, square footage doesn’t always have a straightforward, linear relationship with price. We might observe diminishing returns as square footage increases.
The general equation is:
\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \dots + \beta_n X^n + \epsilon \]
We’ll explore different polynomial degrees (2nd, 3rd, and 4th) to see how well they model the relationship between square footage and price.
# Fit polynomial models of different degrees using the correct column names
poly_model_2 <- lm(PRICE ~ poly(PROPERTYSQFT, 2), data = housing_data_clean)
poly_model_3 <- lm(PRICE ~ poly(PROPERTYSQFT, 3), data = housing_data_clean)
poly_model_4 <- lm(PRICE ~ poly(PROPERTYSQFT, 4), data = housing_data_clean)
# Summary of the models
summary(poly_model_2)
##
## Call:
## lm(formula = PRICE ~ poly(PROPERTYSQFT, 2), data = housing_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40198049 -2088098 -690444 1584970 2124547650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2356940 448029 5.261 1.50e-07 ***
## poly(PROPERTYSQFT, 2)1 240889791 31043566 7.760 1.03e-14 ***
## poly(PROPERTYSQFT, 2)2 -193008624 31043566 -6.217 5.49e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31040000 on 4798 degrees of freedom
## Multiple R-squared: 0.02019, Adjusted R-squared: 0.01978
## F-statistic: 49.43 on 2 and 4798 DF, p-value: < 2.2e-16
summary(poly_model_3)
##
## Call:
## lm(formula = PRICE ~ poly(PROPERTYSQFT, 3), data = housing_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36285179 -2155628 -737991 1700763 2124460752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2356940 448056 5.260 1.50e-07 ***
## poly(PROPERTYSQFT, 3)1 240889791 31045493 7.759 1.04e-14 ***
## poly(PROPERTYSQFT, 3)2 -193008624 31045493 -6.217 5.50e-10 ***
## poly(PROPERTYSQFT, 3)3 19741253 31045493 0.636 0.525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31050000 on 4797 degrees of freedom
## Multiple R-squared: 0.02027, Adjusted R-squared: 0.01966
## F-statistic: 33.09 on 3 and 4797 DF, p-value: < 2.2e-16
summary(poly_model_4)
##
## Call:
## lm(formula = PRICE ~ poly(PROPERTYSQFT, 4), data = housing_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51881728 -1618624 -398736 981415 2120706746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2356940 447380 5.268 1.44e-07 ***
## poly(PROPERTYSQFT, 4)1 240889791 30998589 7.771 9.46e-15 ***
## poly(PROPERTYSQFT, 4)2 -193008624 30998589 -6.226 5.18e-10 ***
## poly(PROPERTYSQFT, 4)3 19741253 30998589 0.637 0.524
## poly(PROPERTYSQFT, 4)4 122150360 30998589 3.941 8.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1e+07 on 4796 degrees of freedom
## Multiple R-squared: 0.02343, Adjusted R-squared: 0.02262
## F-statistic: 28.77 on 4 and 4796 DF, p-value: < 2.2e-16
# Visualize polynomial fits using correct column names
ggplot(housing_data_clean, aes(x = PROPERTYSQFT, y = PRICE)) +
geom_point(color = "pink") + # Color the points pink
stat_smooth(method = "lm", formula = y ~ poly(x, 2), col = "hotpink") + # Polynomial fit degree 2 in pink
stat_smooth(method = "lm", formula = y ~ poly(x, 3), col = "deeppink") + # Polynomial fit degree 3 in deeper pink
labs(title = "Polynomial Regression: Different Degrees of Fit",
x = "Property Square Footage",
y = "Price") +
theme_minimal()
The quadratic model (2nd-degree polynomial) fits the data well and captures the trend of diminishing returns. The 3rd and 4th-degree models introduce more flexibility, but they also risk overfitting. The red and green curves show that higher-degree models can capture more nuances but may fit noise rather than true patterns in the data.
Multiple linear regression allows us to account for more than one independent variable. In this case, we’ll include bedrooms and bathrooms alongside square footage to predict house prices. The general equation is:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \epsilon \]
In addition to the basic MLR, we’ll also explore interaction terms. For example, the interaction between square footage and bedrooms may have a combined effect on price.
# Fit the multiple linear regression model with interaction terms using correct column names
mlr_model_interaction <- lm(PRICE ~ PROPERTYSQFT * BEDS + BATH, data = housing_data_clean)
# Summary of the model
summary(mlr_model_interaction)
##
## Call:
## lm(formula = PRICE ~ PROPERTYSQFT * BEDS + BATH, data = housing_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66603880 -1390611 -607067 566921 2132939346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.363e+06 9.998e+05 -2.363 0.01814 *
## PROPERTYSQFT 1.435e+03 2.376e+02 6.041 1.64e-09 ***
## BEDS -2.325e+05 2.971e+05 -0.783 0.43390
## BATH 1.164e+06 4.010e+05 2.902 0.00372 **
## PROPERTYSQFT:BEDS -4.001e+01 2.427e+01 -1.649 0.09924 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31140000 on 4796 degrees of freedom
## Multiple R-squared: 0.01419, Adjusted R-squared: 0.01337
## F-statistic: 17.26 on 4 and 4796 DF, p-value: 4.555e-14
# Diagnostic plots with pink color
par(mfrow = c(2, 2)) # Arrange plots in a 2x2 layout
plot(mlr_model_interaction, col = "deeppink", pch = 16) # Set color to pink and point type to solid circle
Multicollinearity can distort the coefficients of the regression model. We’ll use the Variance Inflation Factor (VIF) to check for multicollinearity.
# Install and load car package for VIF
if (!require(car)) install.packages("car")
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(car)
# Checking for multicollinearity using type = "predictor" to handle interaction terms
vif(mlr_model_interaction, type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
## PROPERTYSQFT 3.01655 3 1.202039 BEDS BATH
## BEDS 3.01655 3 1.202039 PROPERTYSQFT BATH
## BATH 3.01655 1 1.736822 -- PROPERTYSQFT, BEDS
The MLR model shows that square footage is the strongest predictor of price, but the interaction between bedrooms and square footage also plays a role. For example, larger homes with more bedrooms tend to be priced higher, but there are diminishing returns when the home becomes too large.
The VIF values suggest no significant multicollinearity between the predictors, which means our model is reliable.
We’ll check the assumptions of polynomial regression by looking at the residuals to ensure there are no patterns that suggest the model is missing key information.
# Residual diagnostics for the polynomial model with pink colors
par(mfrow = c(2, 2)) # Arrange plots in a 2x2 layout
plot(poly_model_2, col = "deeppink", pch = 16) # Set color to pink and point type to solid circle
We’ll also check the residuals of the MLR model to ensure that the assumptions of homoscedasticity and normality are met.
# Residual diagnostics for the MLR model with pink colors
par(mfrow = c(2, 2)) # Arrange plots in a 2x2 layout
plot(mlr_model_interaction, col = "deeppink", pch = 16) # Set color to pink and point type to solid circle
Through this analysis, we’ve seen how both polynomial regression and multiple linear regression can be used to predict house prices. Polynomial regression is helpful when dealing with nonlinear relationships, such as the diminishing returns of square footage. Multiple linear regression provides a broader view by incorporating several variables that influence house prices, like bedrooms and bathrooms.
For someone like me, who grew up watching the housing market in Brooklyn, these models provide a clearer understanding of how homes are valued. Whether you’re a buyer, seller, or investor, having this knowledge can help you make more informed decisions in a competitive market like New York City.