Introduction

In this project I’ll explore how well the size of a condominium (measured in gross square feet) explains, or predicts, sale price in New York City. I will also explore how well the size of a condominium predicts sale price in each of the five individual boroughs of New York City: the Bronx, Brooklyn, Manhattan, Staten Island, and Queens.

The property sales data is publicly available and contains sales records from a twelve-month period (November, 2018 through October, 2019).

Understanding the data Loading the libraries needed

library(readxl) # Load Excel files
library(magrittr) # Make all colnames lower case with no spaces
library(stringr) # String formatting and replacement
library(dplyr) # Data wrangling and manipulation
library(readr) # Load and write csv files
library(ggplot2) # Data visualization
library(tidyr) # Nesting and unnesting dataframes

The readr package is loaded so that the csv file can be read into R.

library(readr)
NYC_property_sales <- read_csv('NYC_property_sales.csv')

A first glimpse of the data reveals that there are currently over 38,000 sale records in the dataset.

library(dplyr) # Data wrangling and manipulation
glimpse(NYC_property_sales)
## Rows: 38,177
## Columns: 20
## $ borough                        <chr> "Bronx", "Bronx", "Bronx", "Bronx", "Br~
## $ neighborhood                   <chr> "Bathgate", "Bathgate", "Bathgate", "Ba~
## $ building_class_category        <chr> "01 One Family Dwellings", "01 One Fami~
## $ tax_class_at_present           <chr> "1", "1", "1", "1", "1", "1", "1", "1",~
## $ block                          <dbl> 3030, 3030, 3039, 3043, 3046, 3046, 305~
## $ lot                            <dbl> 62, 70, 63, 55, 35, 39, 62, 152, 119, 1~
## $ building_class_at_present      <chr> "A1", "A1", "A1", "A1", "A1", "A1", "S1~
## $ address                        <chr> "4463 Park Avenue", "4445 Park Avenue",~
## $ apartment_number               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ zip_code                       <dbl> 10457, 10457, 10458, 10457, 10457, 1045~
## $ residential_units              <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, ~
## $ commercial_units               <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ total_units                    <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, ~
## $ land_square_feet               <dbl> 1578, 1694, 1650, 2356, 2050, 1986, 785~
## $ gross_square_feet              <dbl> 1470, 1497, 1296, 2047, 1560, 1344, 154~
## $ year_built                     <dbl> 1899, 1899, 1910, 1901, 1899, 1899, 193~
## $ tax_class_at_time_of_sale      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ building_class_at_time_of_sale <chr> "A1", "A1", "A1", "A1", "A1", "A1", "S1~
## $ sale_price                     <dbl> 455000, 388500, 419000, 470000, 445000,~
## $ sale_date                      <dttm> 2018-11-28, 2019-07-23, 2018-12-20, 20~

For this project I will only work with a single type of building class (“R4”), a condominium residential unit in a building with an elevator. This building class is the most common building class in this NYC_property_sales dataframe.

NYC_condos <- NYC_property_sales %>% 
  # Filter to include only property type: CONDO; RESIDENTIAL UNIT IN ELEVATOR BLDG.
  # https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html
  filter(building_class_at_time_of_sale == "R4")

Exploring Bivariate Relationships with Scatter plots

Now that the data is loaded, processed, and ready to analyze I will use scatter plots to visualize the relationships between condominium sale price and size. The scatter plot below depicts sale price versus size for all five New York City boroughs, combined. In general there is a trend that larger condominiums are associated with a higher sale price. The data follows a somewhat linear pattern. There is no obvious curvature with the shape of the data, but there is a fair amount of spread. The strength of the bi-variate relationship is moderate.

library(ggplot2)
ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(aes(color = borough), alpha = 0.5) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 75000000)) +
  xlim(0, 10000) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")

Zooming in on a smaller subset of the data, it is observed in the same trend below that in general, as the size of a condominium increases, so does the sale price. The pattern is somewhat linear, but there is a fair amount of spread, or dispersion, that becomes more pronounced with an increase in condominium size.

library(ggplot2)
ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(aes(color = borough), alpha = 0.3) +
  scale_y_continuous(labels = scales::comma, limits = c(0, 20000000)) +
  xlim(0, 5000) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 153 rows containing non-finite values (stat_smooth).
## Warning: Removed 153 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing missing values (geom_smooth).

To better visualize the spread of data for each borough, the y-axis and x-axis scales that are specific to each borough is used. What neighborhoods have outliers that should be investigated? is the question we look into next.

ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(alpha = 0.3) +
  facet_wrap(vars(borough), scales = "free", ncol = 2) +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")
## `geom_smooth()` using formula 'y ~ x'

Looking at the plot above, it is seen that, in general, larger condominiums are associated with a higher sale price in each borough as well. The data follows a somewhat linear pattern in each plot. But the spread is difficult to see with the Manhattan scatter plot, potentially because of the property sale of around $200 million visible to the far-right which may be impacting the visualization. There is no obvious curvature with the shape of the data, for any borough. The strength of the bi-variate relationship is moderate for most boroughs, except for the Queens borough which looks to have a weaker relationship between sale price and size.

Outliers and Data Integrity Issues

The investigation of outliers begins by sorting all sale records by sale price, from high to low.

NYC_condos %>% 
  arrange(desc(sale_price)) %>% 
  head
## # A tibble: 6 x 20
##   borough  neighborhood      building_class_categ~ tax_class_at_pre~ block   lot
##   <chr>    <chr>             <chr>                 <chr>             <dbl> <dbl>
## 1 Manhatt~ Midtown West      13 Condos - Elevator~ 2                  1030  1082
## 2 Manhatt~ Upper East Side ~ 13 Condos - Elevator~ 2                  1401  1003
## 3 Manhatt~ Gramercy          13 Condos - Elevator~ 2                   901  1001
## 4 Manhatt~ Upper East Side ~ 13 Condos - Elevator~ 2                  1375  1441
## 5 Manhatt~ Upper East Side ~ 13 Condos - Elevator~ 2                  1375  1438
## 6 Manhatt~ Midtown West      13 Condos - Elevator~ 2                  1030  1117
## # ... with 14 more variables: building_class_at_present <chr>, address <chr>,
## #   apartment_number <chr>, zip_code <dbl>, residential_units <dbl>,
## #   commercial_units <dbl>, total_units <dbl>, land_square_feet <dbl>,
## #   gross_square_feet <dbl>, year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>

Research of the highest price listing in the dataset reveals that this property sale was actually the most expensive home ever sold in the United States at the time of the sale. The luxurious building that this particular unit is located in even has its own Wikipedia page.

The real estate transaction with the second-highest sale price in this dataset was also news worthy.

These two most expensive property sales also happen to be the two largest in terms of gross square footage. I will remove the second-highest listing at 165 East 66th Street because this transaction looks to be for an entire block of residences. I would like to limit this analysis to transactions of single units, if possible.

# Make copy of dataframe before removing any sale records
NYC_condos_original <- NYC_condos
# Remove 165 East 66th Street sale record
NYC_condos <- NYC_condos %>% 
  filter(address != "165 East 66th St, Resi")

I will leave the record-setting home sale observation in the dataset for now because the sale price is confirmed to be legitimate.

How well does gross square feet explain sale price for all records combined?

Next I’ll take a look at the highest sale price observations in Brooklyn. There are a number of sale records at a sale price of around $30 Million, but there is only a single observation in the range of $10-$30 Million. Could this be correct?

NYC_condos %>% 
  filter(borough == "Brooklyn") %>% 
  arrange(desc(sale_price))
## # A tibble: 1,751 x 20
##    borough  neighborhood building_class_category  tax_class_at_pres~ block   lot
##    <chr>    <chr>        <chr>                    <chr>              <dbl> <dbl>
##  1 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1001
##  2 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1002
##  3 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1003
##  4 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1004
##  5 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1005
##  6 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1006
##  7 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1007
##  8 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1008
##  9 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1009
## 10 Brooklyn Gowanus      13 Condos - Elevator Ap~ 2                   1046  1010
## # ... with 1,741 more rows, and 14 more variables:
## #   building_class_at_present <chr>, address <chr>, apartment_number <chr>,
## #   zip_code <dbl>, residential_units <dbl>, commercial_units <dbl>,
## #   total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## #   year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>

Looking through the results I see that there are approximately 40 sales records with a price of $29,620,207. This price point appears to be unusual for Brooklyn. Scrolling through the results using the viewer in R Studio I also see that all 40 property sales took place on the same day, 2019-04-08. This indicates that a transaction took place on this date where all 40 units were purchased for a TOTAL price of $29,620,207, not $29,620,207 per unit.

Thanks to the internet it does not take long to find information about this new building. Sure enough, this building contains 40 total units. But according to the website, the average price per unit for the 26 “active sales” is around $990,000 and the average price for the 14 previous sales is around $816,000, per unit.

For the purposes of the project, I will remove all 40 observations from the dataset because sale prices for each unit are erroneous. I could consider other ways of correcting the data. One option is to determine the price-per-square-foot by dividing the $29M sale price by the total number of square feet sold across all 40 units, and then using this number to assign a price to each unit based on its size. But that is not worth the time and I can’t be certain that method would yield valid results.

Fortunately, there is a programmatic option for surfacing potential multi-unit sales where each sale record contains the sale price for the entire real estate deal, not the price for the individual unit. Below I build a grouped filter that returns all sale records with three or more observations that have the same sale price and sale date. In general, multi-unit sales contain the same price and sale date across many sale records. When building a grouped filter, there is a need to be careful not to “over-filter” by making the criteria too specific. In this case it looks like the filter effectively surfaces multi-sale transactions using only two grouping parameters: sale_price and sale_date.

multi_unit_sales <- NYC_condos %>% 
  group_by(sale_price, sale_date) %>% 
  filter(n() >= 3) %>% 
  arrange(desc(sale_price))
head(multi_unit_sales)
## # A tibble: 6 x 20
## # Groups:   sale_price, sale_date [1]
##   borough   neighborhood building_class_category  tax_class_at_pres~ block   lot
##   <chr>     <chr>        <chr>                    <chr>              <dbl> <dbl>
## 1 Manhattan Tribeca      13 Condos - Elevator Ap~ 2                    223  1105
## 2 Manhattan Tribeca      13 Condos - Elevator Ap~ 2                    223  1123
## 3 Manhattan Tribeca      13 Condos - Elevator Ap~ 2                    223  1124
## 4 Manhattan Tribeca      13 Condos - Elevator Ap~ 2                    223  1125
## 5 Manhattan Tribeca      13 Condos - Elevator Ap~ 2                    223  1126
## 6 Manhattan Tribeca      13 Condos - Elevator Ap~ 2                    223  1127
## # ... with 14 more variables: building_class_at_present <chr>, address <chr>,
## #   apartment_number <chr>, zip_code <dbl>, residential_units <dbl>,
## #   commercial_units <dbl>, total_units <dbl>, land_square_feet <dbl>,
## #   gross_square_feet <dbl>, year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>

There are many ways to remove the multi-unit sales from the NYC_condos dataframe. Below are two identical methods: (1) filter for only the sale records we wish to retain that have two or less instances of sale_price and sale_date, or (2) use an anti-join to drop all records from NYC_condos found in multi_unit_sales.

NYC_condos <- NYC_condos %>%
  group_by(sale_price, sale_date) %>%
  filter(n() <= 2) %>%
  ungroup()
# Alternative method
NYC_condos <- NYC_condos %>% 
  anti_join(multi_unit_sales)

Linear Regression Model for Boroughs in New York City Combined

Now that I’ve removed many multi-unit sales from the dataset, time to generate a linear regression model for all New York City neighborhoods combined.

NYC_condos_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos)  
summary(NYC_condos_lm)
## 
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19865368   -840026    156635    938585 140321290 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.110e+06  5.709e+04  -54.47   <2e-16 ***
## gross_square_feet  4.462e+03  3.947e+01  113.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2945000 on 7944 degrees of freedom
## Multiple R-squared:  0.6166, Adjusted R-squared:  0.6166 
## F-statistic: 1.278e+04 on 1 and 7944 DF,  p-value: < 2.2e-16

How does this compare to the NYC_condos_original dataframe that includes multi-unit sales?

NYC_condos_original_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos_original)  
summary(NYC_condos_original_lm)
## 
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = NYC_condos_original)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -79533546  -1211195   -860173   -251714 211550415 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       940683.39   57703.86   16.30   <2e-16 ***
## gross_square_feet   1192.72      19.43   61.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4745000 on 8094 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.3176 
## F-statistic:  3769 on 1 and 8094 DF,  p-value: < 2.2e-16

Comparison of linear modeling results**

A bi-variate linear regression of sale_price (price) explained by gross_square_feet (size) was performed on two different datasets containing condominium sale records for New York City. One dataset, NYC_condos, was cleaned to remove multi-unit sale records (where the same sale price is recorded for many units). The other dataset, NYC_condos_original, remained unaltered and contained all original sale records. In each case, the hypothesis is that there is a relationship between the size of a condominium (gross_square_feet) and the price (sale_price). It can be declared that there is a relationship between condominium size and price when the slope is sufficiently far from zero.

For each model, the t-statistic was high enough, and the p-value was low enough, to declare that there is, in fact, a relationship between gross_square_feet and sale_price. The t-statistic for the cleaned dataset (NYC_condos) was nearly double that of the original dataset (NYC_condos_original) at 113.04 versus 61.39. In each case the p-value was well below the 0.05 cutoff for significance meaning that it is extremely unlikely that the relationship between condominium size and sale price is due to random chance.

The confidence interval for the slope is [4384.254, 4538.999] for the NYC_condos dataset compared to only [1154.636, 1230.802] for the NYC_condos_original dataset. This difference can likely be attributed to the removal of many multi-million dollar sale records for smaller units which impacted price predictions in the original dataset. The measure for lack of fit, or residual standard error (RSE) was lower for the cleaned dataset at 2,945,000 compared to 4,745,000 for the original dataset. However, it must be noted that the NYC_condos is smaller than the NYC_condos_original by 150 observations. Finally, the R-squared, or the proportion of the variability in sale_price that can be explained by gross_square_feet is 0.6166 for the cleaned NYC_condos. This is nearly double the R-squared value estimated for the NYC_condos_original dataset at 0.3177.

Below is the updated scatterplot that uses the cleaned NYC_condos data. For the Brooklyn borough we are better able to see the spread of the data and how the trend line fits the data because we removed the $30 million outliers. The same is true for the Manhattan borough because the $200 million multi-unit sale was removed.

ggplot(data = NYC_condos, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(alpha = 0.3) +
  facet_wrap(~ borough, scales = "free", ncol = 2) +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Condominium Sale Price in NYC Generally Increases with Size",
       x = "Size (Gross Square Feet)",
       y = "Sale Price (USD)")

Linear Regression Models for each Borough - Coefficient Estimates

Now, apply the broom workflow to compare coefficient estimates across the five boroughs. The general workflow using broom and tidyverse tools to generate many models involves 4 steps:

  1. Nest a dataframe by a categorical variable with the nest() function from the tidyr package - I will nest by borough.
  2. Fit models to nested dataframes with the map() function from the purrr package.
  3. Apply the broom functions tidy(), augment(), and/or glance() using each nested model
  4. Return a tidy dataframe with the unnest() function - this allows us to see the results.
# Step 1: nest by the borough categorical variable
library(broom)
library(tidyr)
library(purrr)
NYC_nested <- NYC_condos %>% 
  group_by(borough) %>% 
  nest()

In the previous step, the NYC_condos dataframe was collapsed from 7,946 observations to only 5. The nesting process isolated the sale records for each borough into separate dataframes.

# Inspect the format
print(NYC_nested)
## # A tibble: 5 x 2
## # Groups:   borough [5]
##   borough       data                 
##   <chr>         <list>               
## 1 Bronx         <tibble [330 x 19]>  
## 2 Brooklyn      <tibble [1,697 x 19]>
## 3 Manhattan     <tibble [4,819 x 19]>
## 4 Queens        <tibble [1,004 x 19]>
## 5 Staten Island <tibble [96 x 19]>

Values of any nested dataframe can be extracted and inspected. Below is a look at the first six rows for Manhattan.

# View first few rows for Manhattan
head(NYC_nested$data[[3]])
## # A tibble: 6 x 19
##   neighborhood  building_class_ca~ tax_class_at_pr~ block   lot building_class_~
##   <chr>         <chr>              <chr>            <dbl> <dbl> <chr>           
## 1 Alphabet City 13 Condos - Eleva~ 2                  373  1008 R4              
## 2 Alphabet City 13 Condos - Eleva~ 2                  373  1009 R4              
## 3 Alphabet City 13 Condos - Eleva~ 2                  375  1017 R4              
## 4 Alphabet City 13 Condos - Eleva~ 2                  378  1005 R4              
## 5 Alphabet City 13 Condos - Eleva~ 2                  384  1205 R4              
## 6 Alphabet City 13 Condos - Eleva~ 2                  384  1210 R4              
## # ... with 13 more variables: address <chr>, apartment_number <chr>,
## #   zip_code <dbl>, residential_units <dbl>, commercial_units <dbl>,
## #   total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## #   year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>

The next step in the process is to fit a linear model to each individual dataframe. What this means is that separate linear models for each borough will be generated individually.

# Step 2: fit linear models to each borough, individually
NYC_nested <- NYC_nested %>% 
  mutate(linear_model = map(.x = data, 
                            .f = ~lm(sale_price ~ gross_square_feet, 
                                     data = .)))

Taking a look at the data structure we see that we have a new list-column called linear_model that contains a linear model object for each borough.

# Inspect the data structure
print(NYC_nested)
## # A tibble: 5 x 3
## # Groups:   borough [5]
##   borough       data                  linear_model
##   <chr>         <list>                <list>      
## 1 Bronx         <tibble [330 x 19]>   <lm>        
## 2 Brooklyn      <tibble [1,697 x 19]> <lm>        
## 3 Manhattan     <tibble [4,819 x 19]> <lm>        
## 4 Queens        <tibble [1,004 x 19]> <lm>        
## 5 Staten Island <tibble [96 x 19]>    <lm>

We can view the linear modeling results for any one of the nested objects using the summary() function. Below are the linear regression statistics for Manhattan.

# Verify model results for Manhattan
summary(NYC_nested$linear_model[[3]])
## 
## Call:
## lm(formula = sale_price ~ gross_square_feet, data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21128191  -1001860    224235   1103543 134333578 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.254e+06  8.584e+04  -37.91   <2e-16 ***
## gross_square_feet  4.728e+03  5.178e+01   91.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3588000 on 4817 degrees of freedom
## Multiple R-squared:  0.6338, Adjusted R-squared:  0.6337 
## F-statistic:  8337 on 1 and 4817 DF,  p-value: < 2.2e-16

A quick look at the R-squared value for the Manhattan linear model indicates that gross_square_feet looks to be a fairly good single predictor of sale_price. Almost two-thirds of the variability with sale_price is explained by gross_square_feet.

The next step is to transform these linear model summary statistics into a tidy format.

# Step 3: generate a tidy dataframe of coefficient estimates that includes confidence intervals
NYC_nested <- NYC_nested %>% 
  mutate(tidy_coefficients = map(.x = linear_model, 
                                 .f = tidy, 
                                 conf.int = TRUE))
NYC_condos
## # A tibble: 7,946 x 20
##    borough neighborhood      building_class_categ~ tax_class_at_pre~ block   lot
##    <chr>   <chr>             <chr>                 <chr>             <dbl> <dbl>
##  1 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1025
##  2 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1039
##  3 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1078
##  4 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1107
##  5 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1168
##  6 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1225
##  7 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1245
##  8 Bronx   Bronxdale         13 Condos - Elevator~ 2                  4340  1246
##  9 Bronx   East Tremont      13 Condos - Elevator~ 2                  3125  1006
## 10 Bronx   Kingsbridge/Jero~ 13 Condos - Elevator~ 2                  5764  1017
## # ... with 7,936 more rows, and 14 more variables:
## #   building_class_at_present <chr>, address <chr>, apartment_number <chr>,
## #   zip_code <dbl>, residential_units <dbl>, commercial_units <dbl>,
## #   total_units <dbl>, land_square_feet <dbl>, gross_square_feet <dbl>,
## #   year_built <dbl>, tax_class_at_time_of_sale <dbl>,
## #   building_class_at_time_of_sale <chr>, sale_price <dbl>, sale_date <dttm>

Now there is a new variable called tidy_coefficients that contains tidy coefficient estimates for each of the five boroughs. These tidy statistics are currently stored in five separate dataframes. Below are the coefficient estimates for Manhattan.

# Inspect the results for Manhattan
print(NYC_nested$tidy_coefficients[[3]])
## # A tibble: 2 x 7
##   term               estimate std.error statistic   p.value  conf.low conf.high
##   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)       -3253992.   85837.      -37.9 1.94e-275 -3422273. -3085712.
## 2 gross_square_feet     4728.      51.8      91.3 0             4626.     4829.

Now we can unnest the tidy_coefficients variable into a single dataframe that includes coefficient estimates for each of New York City’s five boroughs.

# Step 4: Unnest to a tidy dataframe of coefficient estimates
NYC_coefficients_tidy <- NYC_nested %>% 
  select(borough, tidy_coefficients) %>% 
  unnest(cols = tidy_coefficients)
print(NYC_coefficients_tidy)
## # A tibble: 10 x 8
## # Groups:   borough [5]
##    borough   term      estimate std.error statistic   p.value conf.low conf.high
##    <chr>     <chr>        <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 Bronx     (Interce~  -2.54e5   24668.     -10.3  9.89e- 22  -3.03e5  -205485.
##  2 Bronx     gross_sq~   6.49e2      29.6     21.9  2.58e- 66   5.91e2      707.
##  3 Brooklyn  (Interce~  -2.72e5   34487.      -7.88 5.64e- 15  -3.40e5  -204248.
##  4 Brooklyn  gross_sq~   1.29e3      29.7     43.2  1.06e-275   1.23e3     1344.
##  5 Manhattan (Interce~  -3.25e6   85837.     -37.9  1.94e-275  -3.42e6 -3085712.
##  6 Manhattan gross_sq~   4.73e3      51.8     91.3  0           4.63e3     4829.
##  7 Queens    (Interce~   3.99e4   28311.       1.41 1.59e-  1  -1.56e4    95481.
##  8 Queens    gross_sq~   7.33e2      32.0     22.9  8.14e- 94   6.70e2      795.
##  9 Staten I~ (Interce~   4.64e4   28256.       1.64 1.04e-  1  -9.67e3   102536.
## 10 Staten I~ gross_sq~   2.89e2      30.7      9.41 3.30e- 15   2.28e2      349.

The main interest is in the slope which explains the change in y (sale price) for each unit change in x (square footage). We can filter for the slope estimate only as follows.

# Filter to return the slope estimate only 
NYC_slope <- NYC_coefficients_tidy %>%   
  filter(term == "gross_square_feet") %>% 
  arrange(estimate)
print(NYC_slope)
## # A tibble: 5 x 8
## # Groups:   borough [5]
##   borough    term      estimate std.error statistic   p.value conf.low conf.high
##   <chr>      <chr>        <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 Staten Is~ gross_sq~     289.      30.7      9.41 3.30e- 15     228.      349.
## 2 Bronx      gross_sq~     649.      29.6     21.9  2.58e- 66     591.      707.
## 3 Queens     gross_sq~     733.      32.0     22.9  8.14e- 94     670.      795.
## 4 Brooklyn   gross_sq~    1285.      29.7     43.2  1.06e-275    1227.     1344.
## 5 Manhattan  gross_sq~    4728.      51.8     91.3  0            4626.     4829.

The results are arranged in ascending order by the slope estimate. For each of the five boroughs, the t-statistic and p-value indicate that there is a relationship between sale_price and gross_square_feet. In Staten Island, an increase in square footage by one unit is estimated to increase the sale price by about $288, on average. In contrast, an increase in total square footage by one unit is estimated to result in an increase in sale price of about $4,728, on average.

Linear Regression Models for each Borough - Regression Summary Statistics

Now I will apply the same workflow using broom tools to generate tidy regression summary statistics for each of the five boroughs. Below I follow the same process as seen previously with the tidy() function, but instead glance() function is used.

# Generate a tidy dataframe of regression summary statistics
NYC_summary_stats <- NYC_condos %>% 
  group_by(borough) %>% 
  nest() %>% 
  mutate(linear_model = map(.x = data, 
                            .f = ~lm(sale_price ~ gross_square_feet, 
                                     data = .))) %>%
  mutate(tidy_summary_stats = map(.x = linear_model,
                                  .f = glance))
print(NYC_summary_stats)
## # A tibble: 5 x 4
## # Groups:   borough [5]
##   borough       data                  linear_model tidy_summary_stats
##   <chr>         <list>                <list>       <list>            
## 1 Bronx         <tibble [330 x 19]>   <lm>         <tibble [1 x 12]> 
## 2 Brooklyn      <tibble [1,697 x 19]> <lm>         <tibble [1 x 12]> 
## 3 Manhattan     <tibble [4,819 x 19]> <lm>         <tibble [1 x 12]> 
## 4 Queens        <tibble [1,004 x 19]> <lm>         <tibble [1 x 12]> 
## 5 Staten Island <tibble [96 x 19]>    <lm>         <tibble [1 x 12]>

Now we have a new variable called tidy_summary_stats that contains tidy regression summary statistics for each of the five boroughs in New York City. These tidy statistics are currently stored in five separate dataframes. Below we unnest the five dataframes to a single, tidy dataframe arranged by R-squared value.

# Unnest to a tidy dataframe of
NYC_summary_stats_tidy <- NYC_summary_stats %>% 
  select(borough, tidy_summary_stats) %>% 
  unnest(cols = tidy_summary_stats) %>% 
  arrange(r.squared)
print(NYC_summary_stats_tidy)
## # A tibble: 5 x 13
## # Groups:   borough [5]
##   borough      r.squared adj.r.squared   sigma statistic   p.value    df  logLik
##   <chr>            <dbl>         <dbl>   <dbl>     <dbl>     <dbl> <dbl>   <dbl>
## 1 Queens           0.344         0.343  2.80e5     525.  8.14e- 94     1 -14018.
## 2 Staten Isla~     0.485         0.480  7.35e4      88.6 3.30e- 15     1  -1211.
## 3 Brooklyn         0.524         0.524  5.65e5    1868.  1.06e-275     1 -24883.
## 4 Bronx            0.595         0.594  1.51e5     481.  2.58e- 66     1  -4404.
## 5 Manhattan        0.634         0.634  3.59e6    8337.  0             1 -79570.
## # ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
## #   df.residual <int>, nobs <int>

These results will be summarized in the conclusion paragraph below.

Conclusion

The analysis showed that, in general, the gross_square_feet variable is useful for explaining, or estimating, sale_price for condominiums in New York City. We observed that removing multi-unit sales from the dataset increased model accuracy. With linear models generated for New York City as a whole, and with linear models generated for each borough individually, we observed in all cases that the t-statistic was high enough, and the p-value was low enough, to declare that there is a relationship between gross_square_feet and sale_price.

For the linear models that was generated for each individual borough, a wide range in slope estimates was observed. The slope estimate for Manhattan was much higher than the estimate for any of the other boroughs. WeThe record-setting $240 million property sale was not removed from the dataset, but future analysis should investigate the impacts that this single listing has on modeling results.

Finally, regression summary statistics indicate that gross_square_feet is a better single predictor of sale_price in some boroughs versus others. For example, the R-squared value was estimated at approximately 0.63 in Manhattan, and 0.59 in Bronx, compared to an estimate of only 0.35 in Queens. These differences in R-squared correspond with the scatterplots generated for each borough; the strength of sale prices versus gross square feet was higher, and the dispersion (spread), was lower for Manhattan and Brooklyn as compared to Queens where the relationship was noticeably weaker because the data was more spread out.