The purpose of this project is to understand the relationship between the size of a condominium (measured in gross square feet) and the sales price of the condominium using simple linear regression. Using property sales data from New York city, the project explores the following questions:
How well does the size of a condominium explain or predict sales price across New York City as a whole?
How well does the size of a condominium explain or predict sales price for each individual borough in New York City?
As noted above, the project uses data of properties sold in New York City which is available here. Details captured about properties include neighborhood, building type, square footage, etc. Recent data from 2019 does not contain size information for condominiums (gross_square_feet) so data from January 2018 - December 2018 is used in this project.
Property sales information are recorded in Excel files for each of the five boroughs in New York City: Bronx, Brooklyn, Manhattan, Queens and Staten Island.
Let’s import all the R modules relevant for this project, read in the data files and prepare them for analysis.
library(broom)
library(readr)
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(purrr)
library(ggplot2)
#read in property sales files for each borough as a dataframe skipping first 4 header rows which contain metadata
manhattan <- read_excel("manhattan.xlsx", skip = 4)
bronx <- read_excel("bronx.xlsx", skip = 4)
brooklyn <- read_excel("brooklyn.xlsx", skip = 4)
queens <- read_excel("queens.xlsx", skip = 4)
statenisland <- read_excel("statenisland.xlsx", skip = 4)
#combine the dataframes by rows into one dataframe
NYC_property_sales <- bind_rows(manhattan, bronx, brooklyn, queens, statenisland)
cat("Total Number of Property Sales: ", dim(NYC_property_sales)[[1]])
## Total Number of Property Sales: 83782
#remove each borough's dataframe from memory
rm(list = c("manhattan", "bronx", "brooklyn", "queens", "statenisland"))
#preview top 6 rows
head(NYC_property_sales)
Let’s change all column names to lower case, replace spaces with underscores and strip whitespaces around column names if any.
colnames(NYC_property_sales) %<>%
str_trim() %>%
str_replace_all("\\s", "_") %>%
tolower()
head(NYC_property_sales)
Boroughs are identified by numbers 1 through 5 in the data. Let’s replace borough number with the name of the borough and also change the upper case character columns to title case.
NYC_property_sales <- NYC_property_sales %>%
mutate(
borough = case_when(
borough == "1" ~ "Manhattan",
borough == "2" ~ "Bronx",
borough == "3" ~ "Brooklyn",
borough == "4" ~ "Queens",
borough == "5" ~ "Staten Island"
)) %>%
mutate(
neighborhood = str_to_title(neighborhood),
building_class_category = str_to_title(building_class_category),
address = str_to_title(address))
head(NYC_property_sales)
Next, we will perform the following filtering operations on the dataset to get the data needed for our analysis.
Remove observations with gross square footage of zero
Drop missing values in the two columns relevant to our analysis: sale_price and gross_square_feet
Drop ease-ment column which contains no data.
Select distinct observations to avoid data duplicates.
There may be property exchanges between family members which will incur a small transfer fee. We will exclude these transactions assuming a threshold value of $10,000 US dollars.
Because we want to explain condominium sales price, we will focus on a single type of building class(“R4”), i.e., condominiums with elevators. This also helps to compare models across the five New York City boroughs.
NYC_property_sales <- NYC_property_sales %>%
filter (
sale_price > 10000,
gross_square_feet > 0
) %>%
drop_na(gross_square_feet, sale_price) %>%
distinct() %>%
select(-`ease-ment`) %>%
arrange(borough)
NYC_condos_orig <- NYC_property_sales %>%
filter(building_class_at_time_of_sale == "R4")
cat("Total Number of Condo Sales: ", dim(NYC_condos_orig)[[1]])
## Total Number of Condo Sales: 8841
One important assumption of linear regression analysis is linearity. That is there must be a linear relationship between the outcome and predictor variables. Another important assumtpion is that there should be no significant outliers. An outlier, in this case, is an observation for which the value of the outcome variable is far from the value predicted by the linear model.
Let’s use a scatter plot to explore the bivariate relationship between concominium size and sales price to understand the linearity of this relationship and identify any potential outliers.
ggplot(data = NYC_condos_orig,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(aes(color=borough), alpha = 0.4, size=2.5) +
scale_y_continuous(labels = scales::comma, limits = c(0, 75000000)) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Condominium Sale Price in NYC Generally Increases with Size",
x = "Size (Gross Square Feet)",
y = "Sale Price (USD)")
In general, there seems a linear pattern in the data with larger condominiums associated with higher sales prices. Though the data shows no obvious curvature, there is some spread especially as condominium size increases.Qualitatively, we can probably say that there is a moderately strong relationship between gross_square_feet and sale_price.
We can see that sales prices of condominiums are higher in Manhattan compared to other boroughs. It is thus important to also visualize the relationship between gross_square_feet and sale_price for each borough.We will use scales specific to each borough which may help identify potential outliers for investigation.
ggplot(data = NYC_condos_orig,
aes(x = gross_square_feet, y = sale_price, color = borough)) +
geom_point(alpha = 0.3) +
facet_wrap(~ borough, scales = "free", ncol = 2) +
scale_y_continuous(labels = scales::comma) +
geom_smooth(method = "lm", se = FALSE) +
theme(legend.position = "none") +
labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
x = "Size (Gross Square Feet)",
y = "Sale Price (USD)")
Similar to the scatter plot for the combined data for all boroughs, the data shows a linear positive relationship between condominium size and sales price. However, there are some sale_price values that seem potential outliers and should be investigated. For example, in the Queens data, the condominium with the highest sale price does not follow the pattern of the rest of the data. There also seems to be clusters in the data. Let’s investigates these issues next.
Real estate datasets often have data integrity issues where in a multi-units purchase transaction, the total sale price for all combined units is recorded for each individual sales record. For example, a real estate investor may purchase 10 condominiums for a price of $100,000 each, and a grand total of $1 million. However, $1 million is recorded for each condominium purchase transaction, not $100,000. This gross misrepresentation of the true cost of each unit can impact modeling results.
Since multi-unit sales will contain the same price and sale date across many sales, we will build a grouped filter that returns all sales records with three or more observations that have the same sales price and sales date.
multi_unit_sales <- NYC_condos_orig %>%
group_by(sale_price, sale_date) %>%
filter(n() > 2) %>%
arrange(desc(sale_price))
multi_unit_sales
It seems the filter effectively returns multi-sale transactions using the grouping parameters of sale_price and sale_date. Searching many of the addresses listed in the multi_unit_sales dataframe on StreetEasy showed that most of the sales records are part of a multi-unit sales.
We will filter out multi-unit sales from the NYC_condos_orig dataframe. We will also add an ID column to the dataframe so that when we check for outliers, they can be easily identified and removed if necessary.
#Filter out multi-unit sales
NYC_condos_orig <- NYC_condos_orig %>%
group_by(sale_price, sale_date) %>%
filter(n() <= 2) %>%
ungroup()
#Add an index column to uniquely identify each observation.
NYC_condos_orig <- NYC_condos_orig %>%
mutate(ID = 1:nrow(NYC_condos_orig)) %>%
select(ID, everything())
head(NYC_condos_orig)
After filtering out 71 observations constituting multi-unit sales, we are left with a total of 8770 sales data to work with. Next, we will check for outliers.
Earlier, when we used scatter plot to explore bivariate relationship between sale_price and condominium size (i.e., gross_square_feet), we identified data points that may count as outliers. We will investigate outliers in each borough’s data using Cook’s Distance, a metric used in regression analysis to identify influential observations whose inclusion or exclusion impact regression analysis results. These influential outliers, if any, can be identified using Residual vs Leverage plot.
Before we create Residual vs Leverage plot, we must fit a linear model to the sales data. We use the following steps:
group the dataset by borough and create a nested dataframe using the nest() function from the tidyr package. Essentially, the nesting process will isolate the sales records for each borough into separate dataframes. Thus, the 8,770 observations in the NYC_condos_orig dataframe will be collapsed into a dataframe with only 5 rows, where each row contains the data for each borough.
using the map() function, iterate over the 5 rows in the nested dataframe and fit a linear model to each borough’s data with the lm() function.
add the fitted models to the nested dataframe
# nest the NYC_condos_orig dataframe by the borough categorical variable
NYC_condos_orig_nested <- NYC_condos_orig %>%
group_by(borough) %>%
nest()
#fit a linear model to each borough's data and add the list-column 'linear_model' containing the fitted models
NYC_condos_orig_lm <- NYC_condos_orig_nested %>%
mutate(
linear_model = map(.x=data,
.f= ~lm(sale_price ~ gross_square_feet, data = .))
)
NYC_condos_orig_lm
Looking at the data structure of the NYC_condos_orig_lm dataframe, we see the data and linear_model list-columns containing the data and a linear model object for each borough, respectively.
We can pass the fitted model for each borough to the plot() base R function.This creates several regression diagnostic plots. To display only the Residual vs Leverage plot, we will pass in 5 as a second argument to the plot() function.
plot(NYC_condos_orig_lm$linear_model[[1]], 5)
The Residual vs Leverage plot displays the top 3 most influential outliers by default.The outliers are labeled and are often the data points outside of a dashed line, i.e., Cook’s distance, as shown in the above plot.
To display the outlying observations for further assessment, let’s add the fitted values and residuals to the original data using the augment() function and select the 3 observations with maximum Cook’s distance values.
bronx_outliers <- augment(
NYC_condos_orig_lm$linear_model[[1]], NYC_condos_orig_lm$data[[1]]) %>%
slice_max(order_by = .cooksd, n=3) %>%
select(-.fitted : -.cooksd)
bronx_outliers
We can see from the results that the properties with sales prices $1480000 and $857000 have the same size, 2941 gross square feet. The scatter plot for Bronx we saw earlier suggests 3 properties with the same size of 2941 as confirmed below.
The 3 properties are individual units within a larger property as noted on StreetEasy. Since the actual size of each unit is unknown, we will exclude these transactions from further analysis.
bronx_outliers <- NYC_condos_orig_lm$data[[1]] %>%
filter(gross_square_feet == 2941)
bronx_outliers
plot(NYC_condos_orig_lm$linear_model[[2]], 5)
brooklyn_outliers <- augment(
NYC_condos_orig_lm$linear_model[[2]], NYC_condos_orig_lm$data[[2]]) %>%
slice_max(order_by = .cooksd, n=3) %>%
select(-.fitted : -.cooksd)
brooklyn_outliers
The following site shows the three property sales are valid transactions so they will be kept.
plot(NYC_condos_orig_lm$linear_model[[3]], 5)
manhattan_outliers <- augment(
NYC_condos_orig_lm$linear_model[[3]], NYC_condos_orig_lm$data[[3]]) %>%
slice_max(order_by = .cooksd, n=3) %>%
select(-.fitted : -.cooksd)
manhattan_outliers
The observations identified as outliers pertain to building units which are part of a single property and the same gross square feet of 9138 is listed for each unit. These transactions will be excluded from further analyses.
plot(NYC_condos_orig_lm$linear_model[[4]], 5)
queens_outliers <- augment(
NYC_condos_orig_lm$linear_model[[4]], NYC_condos_orig_lm$data[[4]]) %>%
slice_max(order_by = .cooksd, n=3) %>%
select(-.fitted : -.cooksd)
queens_outliers
The first two transactions depart from the general pattern of the remaining data. The first property is the third largest (2781 gross square feet) in the Queens data but is among the cheapest sold ($253,497). On the other hand, the second propert is one of the smallest properties (557 gross square feet) but has the highest sales price ($9,800,000).We will delete these observations and maintain the third transaction.
queens_outliers <- filter(queens_outliers, ID != 8181)
queens_outliers
plot(NYC_condos_orig_lm$linear_model[[5]], 5)
statenisland_outliers <- augment(
NYC_condos_orig_lm$linear_model[[5]], NYC_condos_orig_lm$data[[5]]) %>%
slice_max(order_by = .cooksd, n=3) %>%
select(-.fitted : -.cooksd)
statenisland_outliers
The three transactions will be kept as the StreetEasy site shows they are valid transactions.
Let’s combine into one dataframe all the identified outliers that will be removed from the original dataset, i.e., NYC_condos_orig.
#combine outlier dataframes by rows using the `bind_rows()` function
outliers_df <- bind_rows(
bronx_outliers,
manhattan_outliers,
queens_outliers)
outliers_df
We will now filter out outlier observations from the NYC_condos_orig dataframe using the ID column. We name the resultant dataframe as NYC_condos.
NYC_condos <- NYC_condos_orig %>%
filter(!(ID %in% outliers_df$ID))
cat("Total observations after removing outliers: ", dim(NYC_condos)[[1]])
## Total observations after removing outliers: 8762
Having removed outliers, we are ready to generate a linear regression model of the relationship between sale_price and gross_square_feet. We will first fit a model to the combined data for all New York City boroughs. In a later task, we’ll build many models to compare results across boroughs.
It is worth investigating whether removing outliers have an impact on the regression results. Therefore, we will generate linear models using data from before and after removing outliers. Let’s start with the original data, NYC_condos_orig, which includes all sales records.
NYC_condos_orig_all_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos_orig)
summary(NYC_condos_orig_all_lm)
##
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos_orig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13484193 -656027 62384 615047 51847671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.061e+06 3.952e+04 -52.16 <2e-16 ***
## gross_square_feet 3.466e+03 2.867e+01 120.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2074000 on 8768 degrees of freedom
## Multiple R-squared: 0.625, Adjusted R-squared: 0.6249
## F-statistic: 1.461e+04 on 1 and 8768 DF, p-value: < 2.2e-16
Now, let’s see how the above results compare to a model for the NYC_condos data which has some outliers removed.
NYC_condos_all_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos)
summary(NYC_condos_all_lm)
##
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12453256 -646597 34324 558765 51797681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.877e+06 3.731e+04 -50.3 <2e-16 ***
## gross_square_feet 3.294e+03 2.729e+01 120.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1935000 on 8760 degrees of freedom
## Multiple R-squared: 0.6245, Adjusted R-squared: 0.6244
## F-statistic: 1.457e+04 on 1 and 8760 DF, p-value: < 2.2e-16
We performed a simple linear regression of the price of a condominium (sale_price) explained by size of the condominium (gross_square_feet).Two datasets containing condominium sales records for New York City were used: NYC_condos_orig and NYC_condos. The NYC_condos_orig dataset contains all original property sales records whereas some outliers were removed from the NYC_condos dataset. Each model was to test the hypothesis that there is a relationship between sale_price and gross_square_feet.
The beta coefficient for gross_square_feet was slightly higher for the NYC_condos_orig data than the NYC_condos data at 3466 versus 3294. However, both models confirm a relationship between the size and price of a condominium given the high t-statistic and low p-value associated with the beta coefficient in each model. Interestingly, the t-statistic was the same for the two models (120.8 vs 120.7). The p-value was extremely low (well below 0.001) for each model indicating that the relationship between condominium size and price cannot be attributed to a random chance.
The residual standard error (RSE), i.e., the average error of each model, was lower for the model using the NYC_condos dataset ($1,935,000) than the model using the NYC_condos_orig dataset ($2,074,000).However, for each model, the condominium size explained the same amount of variation in sales price; the adjusted R-squared was 0.62 in each case.The regression results indicate that the size of condominium is a good single predictor of sales price.
Overall, there was not much difference in the regression results between the model for the NYC_condos_orig data which conatained all property sales records and the NYC_condos data which was cleaned to remove some outliers. This may be due to the fact that we checked for outliers in the data for each borough and not the combined dataset which may still have outlying observations. Next, We will fit a model for each borough, again using both datasets, to see if removing outliers has an impact on the regression results across the five New York City boroughs.
The rest of this project focuses on analysing the relationship between the size (gross_square_feet) and price (sale_price) of a condominium for each New York City’s borough. We will first work with the original dataset, NYC_condos_orig, and later the NYC_condos data, which has some outliers removed.
Earlier in the project, to check for outliers, we used the nest() function from the tidyr package to nest the NYC_condos_orig dataframe and fitted a linear model to the data for each borough. The resultant nested dataframe, which has the data and a fitted linear model for each borough, is NYC_condos_orig_lm.
Let’s take a look at the NYC_condos_orig_lm.
NYC_condos_orig_lm
We can use the tidy() function from the broom package on the fitted models to return a tidy dataframe of the regresion coefficients for each borough. Let’s store the output of applying the tidy() function in a new list-column tidy-coeff.
# generate a tidy dataframe of reg coefficient estimates
NYC_condos_orig_coeff <- NYC_condos_orig_lm %>%
mutate(
tidy_coeff = map(.x = linear_model,
.f = tidy,
conf.int = TRUE)
)
NYC_condos_orig_coeff
Now, using the unnest() function from the tidyr package, we can unnest the NYC_condos_orig_coeff nested dataframe by the tidy_coeff variable into a single dataframe containing coefficient estimates for each borough in New York City.
For regression analysis, we are largely interested in the coefficient estimates for the slope. In the current context, the slope explains the change in condominium price for each unit change in a condominium’s size (measured in square footage). Therefore, we will filter the unnested dataframe to return only the slope estimates.
# Unnest by the 'tidy_coeff` col to a tidy dataframe of coefficient estimates
NYC_condos_orig_coeff <- NYC_condos_orig_coeff %>%
select(borough, tidy_coeff) %>%
unnest(cols = tidy_coeff)
# Filter to return coefficient estimates for the slope, i.e., gross_square_feet
NYC_condos_orig_slope <- NYC_condos_orig_coeff %>%
filter(term == "gross_square_feet") %>%
arrange(estimate) #arrange results by slope estimate
NYC_condos_orig_slope
The t-statistic and p-value indicate that there is a relationship between sale_price and gross_square_feet for each of the five boroughs. However, the coefficient estimates vary widely across boroughs with the effect of gross_square_feet on sale_price greater for condominiums in Manhattan than in all other boroughs. On average, in Manhattan, an increase in square footage by one unit is estimated to increase the sales price by about $3,724. In contrast, in Staten Island, an increase in square footage by one unit is estimated to increase sales price by about $529.
Now let’s apply the same workflow using broom tools to generate tidy regression summary statistics for each of the five boroughs. The only difference is that instead of using the tidy() function, we will use the glance() function to return model summaries.
# add a list-column `tidy-summary-stats` containing tidy dataframe of regression summary stats for each borough
NYC_summary_stats_orig <- NYC_condos_orig_lm %>%
mutate(tidy_summary_stats = map(.x = linear_model,
.f = glance))
# Unnest to a tidy dataframe
NYC_summary_stats_orig <- NYC_summary_stats_orig %>%
select(borough, tidy_summary_stats) %>%
unnest(cols = tidy_summary_stats) %>%
arrange(r.squared) #arrange by R-squared
NYC_summary_stats_orig
In general, the regression summary satatistics show that the size of a condominium (gross_square_feet) explains much variance in the price of condominiums (sale_price) in some boroughs versus others.For example, the R-squared value was estimated at approximately 0.65 in Manhattan, and 0.64 in Bronx, compared to an estimate of only 0.21 in Queens. One of the smallest condominiums (557 gross_square_feet) has the highest sales price ($9,800,000) in the Queens data. This largely departs from the other observations and may be influencing the low R-squared value for Queens.
Below, we will continue the analysis using the NYC_condos dataset which excludes influential outliers and compare the results with the above regression results based on the original dataset, NYC_condos_orig. Again, We will apply the broom workflow to generate regression coefficient and summary statistics for the five boroughs. To summarise, the general workflow using broom and tidyverse tools to generate many models are as follows:
Nest a dataframe by a categorical variable using the nest() function from the tidyr package - we nest by borough.
Fit models to nested dataframes with the map() function from the purrr package.
Apply the tidy(), augment(), and/or glance() functions from the broom package using each nested model.
Return a tidy dataframe with the unnest() function - this allows us to see the results.
#step 1: nest dataframe by `borough`
NYC_condos_nested <- NYC_condos %>%
group_by(borough) %>%
nest()
#steps 2 & 3: fit models to nested dataframe with the `map()` and apply the `tidy()` function to generate a tidy dataframe of reg coefficient estimates
NYC_condos_lm <- NYC_condos_nested %>%
mutate(
linear_model = map(.x=data,
.f= ~lm(sale_price ~ gross_square_feet, data = .))) %>%
mutate(
tidy_coeff = map(.x = linear_model,
.f = tidy,
conf.int = TRUE))
# Step 4: Unnest to a tidy dataframe of coefficient estimates
NYC_condos_coeff <- NYC_condos_lm %>%
select(borough, tidy_coeff) %>%
unnest(cols = tidy_coeff)
# Filter to return the slope estimate only
NYC_condos_coeff <- NYC_condos_coeff %>%
filter(term == "gross_square_feet") %>%
arrange(estimate)
NYC_condos_coeff
Similar to the regression results using the NYC_condos_orig original data, the results for the NYC_condos data show a relationship between sale_price and gross_square_feet for each of the five boroughs. It must be noted that outliers were removed from the Queens, Bronx and Manhattan data. Removing outliers resulted in an increase in the coefficient estimates for Queens (from 711 to 768) and Bronx (from 678 to 802). In contrast, there was a decrease in the case of Manhattan; the coefficient estimate based on the original dataset was 3724 versus 3514 using the NYC_condos data.
Now, let’s check the regression summary statistics for each borough.
# Generate a tidy dataframe of regression summary statistics
NYC_condos_summary_stats <- NYC_condos_lm %>%
mutate(tidy_summary_stats = map(.x = linear_model,
.f = glance))
NYC_summary_stats_tidy <- NYC_condos_summary_stats %>%
select(borough, tidy_summary_stats) %>%
unnest(cols = tidy_summary_stats) %>%
arrange(r.squared)
NYC_summary_stats_tidy
Overall, the results indicate that removing outliers from the Queens data increased the variance in sale_price explained by gross_square_feet; R-squared value was 0.34 for the NYC_condos data versus 0.21 for the original data, NYC_condos_orig. For both Bronx and Manhattan, there was a marginal decrease of .01 in R-squared value with the NYC_condos data. Also, residual standard error (i.e., sigma) was lower for these three boroughs in the model based on the NYC_condos data versus the NYC_condos_orig data.
This project explored how well the size of a condominium (measured in gross square feet) explains the condominium’s sales price in New York City and in each of the City’s five boroughs: Bronx, Brooklyn, Manhattan, Queens and Staten Island.Toward this end, we generated linear models for New York City as a whole as well as linear models for each individual borough. All the models showed a high t-statistic and low p-value, confirming a relationship between gross_square_feet and sale_price. For the models for each individual borough, there was a wide range in coefficient (slope) estimates with Manhattan having the highest slope estimate.
Similarly, gross_square_feet explained greater variance in sale_price of condominiums in some boroughs versus others.Almost two-thirds of the variance in sale_price is explained by gross_square_fee in the models for Bronx (0.63) and Manhattan(0.64), and more than half of variability in the case of Brooklyn(0.59). In contrast, the R-squared valued was estimated at about 0.48 in Staten Island and 0.34 in Queens.
Overall, our analysis show that the size of a condominium is a better single predictor of the price of condominiums in some New York City’s boroughs versus others. For future analysis, it will be important to explore other relevant predictors of sales price, especially in boroughs such as Queens and Staten Island where gross_square_feet did not explain much variance in sale_price.