1 Introduction

New York City properties have skyrocked over the last decade. As part of this project, I am trying to understand if this trend is across all types of property (multi-units vs. single family) in the city. e.g. NYC as whole, or if it is isolated to just Brooklyn. One of the challenges I will face with this project is that my dataset is limited to sales of properties in Brooklyn over the last year. I may not be able to infer on NYC and or even on the entire property types across the US. What I can probably infer is whether all properties of a certain type are being affected more or less around certain zips code within Brooklyn. I think this is important to understand for several reasons:

1.1 Research Question

In this project, I try to answer the following questions based on the data that is available to me:

  • Do single family condos and single family homes in Brooklyn, NY sell at a similar average?

  • If you are a family trying to purchase a home in NYC, or anywhere else in the city, are there specific properties types that I should target for affordability?

  • Is there a clear correlation between square footage and the price of a property? Is it consistent for all three types of properties collected here?

    • Single Family
    • Condos
    • Multi-Family with no commerical units
    • Multi-Family with commerical units

I will try to answer the questions and above and make inferences on the data.

2 Data

2.1 Data collection

The NYC department of finance publishes properties sold within the last twelve months on their site. It includes several variables that I was interested in analyzing. For example, some of the data available includes the following:

  • Neighborhood
  • Building Type
  • Square Footage
  • Zip Code

And several other variables.

I pulled an XLS file from the NYC.Gov site and then converted it to a CSV file. I uploaded the converted file to Github to ensure that I could download the data from a stable and reliable repository.

The NYC Department of Finance contains data for all the boros, but I chose to focus solely on Brooklyn, NY for this analysis.

Data Filters:

Because of the type of properties I am interested in, I chose to filter properties on the following conditions:

  • Multi-Family Houses up to 3 units
  • Condos
  • Condos or House total square footage under 5000
  • properties under 5000 value - These were filtered due to the condition that these properties were transferred as gifts.

The reason for filtering properties over 5000 sq ft is due to these properties having special conditions that the average person will not be able to afford: ~ 1% of the population.

# load data
library(httr)
library(RCurl)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
csv_file = "https://raw.githubusercontent.com/dapolloxp/R-Projects/master/rollingsales_brooklyn_2019.csv"
out_file <- getURL(csv_file )
property.data <- read.csv(text = out_file, stringsAsFactors = FALSE)

reduced_result <- property.data %>% select("NEIGHBORHOOD","BUILDING.CLASS.CATEGORY", "YEAR.BUILT","ADDRESS", "APARTMENT.NUMBER","ZIP.CODE","RESIDENTIAL.UNITS", "COMMERCIAL.UNITS", "TOTAL.UNITS", "LAND.SQUARE.FEET", "GROSS.SQUARE.FEET", "SALE.PRICE", "SALE.DATE")

## convert sales price to numeric
reduced_result[["SALE.PRICE"]] <- as.numeric((gsub("[^0-9]","", reduced_result[["SALE.PRICE"]])))
## convert sales date from character to date
reduced_result[["SALE.DATE"]] <- as.Date(reduced_result[["SALE.DATE"]], "%m/%d/%y")
x<-as.character(reduced_result[["ZIP.CODE"]])
reduced_result[["ZIP.CODE"]] <- as.factor(x)
##
reduced_result[["LAND.SQUARE.FEET"]] <- as.numeric((gsub("[^0-9]","", reduced_result[["LAND.SQUARE.FEET"]])))

## 
reduced_result[["GROSS.SQUARE.FEET"]] <- as.numeric((gsub("[^0-9]","", reduced_result[["GROSS.SQUARE.FEET"]])))

reduced_result[["TOTAL.UNITS"]] <- as.integer(reduced_result[["TOTAL.UNITS"]])

reduced_result[["RESIDENTIAL.UNITS"]] <- as.integer(reduced_result[["RESIDENTIAL.UNITS"]])
reduced_result[["ADDRESS"]] <- as.character(reduced_result[["ADDRESS"]])
##leaflet() %>% addTiles() %>% setView(-73.994, 40.6782, zoom = 11)
### Removing any entries that are for 2017
sales_2018_2019 <- subset(reduced_result, reduced_result$SALE.DATE > "2017-12-31")
#reduced_result
## Since I want to do computation on price, I am filtering any values that do not contain price sales
#recorded_sales_2018_2019 <- sales_2018_2019 %>% filter(SALE.PRICE != "NA")
#recorded_sales_2018_2019 <- sales_2018_2019 %>% filter(SALE.PRICE != "NA" | SALE.PRICE == 0)
recorded_sales_2018_2019 <- sales_2018_2019 %>% filter(!is.na(SALE.PRICE) & GROSS.SQUARE.FEET <= 4000 & GROSS.SQUARE.FEET > 0 & SALE.PRICE >= 3500)
## Below is a summary of sale prices
## The medium price is $790,800 for Brooklyn and the average sale price for 2018 was $1,356,196 million.

2.2 Cases

The cases are the total number of sales in Kings County (Brooklyn, NY). In this dataset there are 22924 observations. However, for the data that I am looking to assess, I must filter out all properties that are not single family or multi-family properties of up to 3 families. In addition, although this could potentially impact my analysis, I also removed several transfers of properties where the recorded value was 0. Per the NYC website, these values indicate that these property transfers were done as a gift. I also divided properties with commercial units and no commercial units as I believe that this will impact any regression techniques that I leverage in my inference section.

multi.fam.no.comm <- recorded_sales_2018_2019[recorded_sales_2018_2019$RESIDENTIAL.UNITS < 4 & recorded_sales_2018_2019$COMMERCIAL.UNITS == 0,]
multi.fam.with.comm <- recorded_sales_2018_2019[recorded_sales_2018_2019$RESIDENTIAL.UNITS < 4 & recorded_sales_2018_2019$COMMERCIAL.UNITS > 0,]
single.fam.house <- recorded_sales_2018_2019[recorded_sales_2018_2019$RESIDENTIAL.UNITS  == 1 & recorded_sales_2018_2019$COMMERCIAL.UNITS == 0 & recorded_sales_2018_2019$BUILDING.CLASS.CATEGORY == '01 ONE FAMILY DWELLINGS',]
condos <- recorded_sales_2018_2019[str_detect(recorded_sales_2018_2019$BUILDING.CLASS.CATEGORY, "CONDOS"),] %>% filter(RESIDENTIAL.UNITS == 1)

2.3 Variables

2.3.1 Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is the sales price and it happens to be quantitative in this case.

2.3.2 Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The independent variables are land.square.feet (quantitative), and neighborhood (qualitative). I can also use zip code in place of neighborhood.

2.4 Type of study

For my Brooklyn, NY property analysis study, my data is considered an observational study since all of these are part recorded events. There are no control groups or splitting of data.

2.5 Scope of inference - generalizability

Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.

For the population of interest, I am looking to assess all Brooklyn homes that consist of single family houses or condos. I am also including multi-family homes with and without commerical units. The general findings of this analysis will not be able to be generalized for the entire population, in this case, all homes in the US. It could potentially be used to assess all homes in NYC, but it is doubtful due to the lack of data from the other boros. In addition, NYC is known to be expensive compared to the rest of the United States, so there known fact prevents the conclusions in this study from being generalized.

2.6 Scope of inference - causality

Can these data be used to establish causal links between the variables of interest? Explain why or why not.

The data included in this study can be used to establish causal links. For example, it is to be expected that square footage should correlate with the price of a home, however, we question whether the following are true:

  • Is the relationship linear for the four categories assessed?

3 Exploratory Data Analysis

3.1 Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

The summary function is used below. For some of the variables, it doesn’t make sense to use the summary statistics, but I have included them for completeness.

For the boxplot, I removed outliers as it would skew the results significantly.

library(lattice)
library(randomcoloR)

summary(condos)
##  NEIGHBORHOOD       BUILDING.CLASS.CATEGORY   YEAR.BUILT  
##  Length:3406        Length:3406             Min.   :   0  
##  Class :character   Class :character        1st Qu.:1910  
##  Mode  :character   Mode  :character        Median :2007  
##                                             Mean   :1617  
##                                             3rd Qu.:2015  
##                                             Max.   :2018  
##                                                           
##    ADDRESS          APARTMENT.NUMBER      ZIP.CODE    RESIDENTIAL.UNITS
##  Length:3406        Length:3406        11201  : 302   Min.   :1        
##  Class :character   Class :character   11215  : 249   1st Qu.:1        
##  Mode  :character   Mode  :character   11217  : 220   Median :1        
##                                        11238  : 220   Mean   :1        
##                                        11249  : 201   3rd Qu.:1        
##                                        11211  : 195   Max.   :1        
##                                        (Other):2019                    
##  COMMERCIAL.UNITS   TOTAL.UNITS     LAND.SQUARE.FEET GROSS.SQUARE.FEET
##  Min.   :-1.0000   Min.   : 1.000   Min.   :     0   Min.   : 349     
##  1st Qu.: 0.0000   1st Qu.: 1.000   1st Qu.:     0   1st Qu.: 736     
##  Median : 0.0000   Median : 1.000   Median :     0   Median :1011     
##  Mean   : 0.2284   Mean   : 1.617   Mean   :  6906   Mean   :1081     
##  3rd Qu.: 0.0000   3rd Qu.: 1.000   3rd Qu.:  5000   3rd Qu.:1271     
##  Max.   :22.0000   Max.   :57.000   Max.   :146587   Max.   :4000     
##                                     NA's   :5                         
##    SALE.PRICE        SALE.DATE         
##  Min.   :   9300   Min.   :2018-04-02  
##  1st Qu.: 595676   1st Qu.:2018-06-19  
##  Median : 814600   Median :2018-09-07  
##  Mean   :1014677   Mean   :2018-09-14  
##  3rd Qu.:1239750   3rd Qu.:2018-12-05  
##  Max.   :8128350   Max.   :2019-03-29  
## 
condos$group <- "Condo"
summary(single.fam.house)
##  NEIGHBORHOOD       BUILDING.CLASS.CATEGORY   YEAR.BUILT  
##  Length:1919        Length:1919             Min.   :   0  
##  Class :character   Class :character        1st Qu.:1920  
##  Mode  :character   Mode  :character        Median :1925  
##                                             Mean   :1928  
##                                             3rd Qu.:1935  
##                                             Max.   :2017  
##                                                           
##    ADDRESS          APARTMENT.NUMBER      ZIP.CODE   RESIDENTIAL.UNITS
##  Length:1919        Length:1919        11234  :347   Min.   :1        
##  Class :character   Class :character   11229  :194   1st Qu.:1        
##  Mode  :character   Mode  :character   11210  :149   Median :1        
##                                        11203  :128   Mean   :1        
##                                        11236  :114   3rd Qu.:1        
##                                        11209  : 88   Max.   :1        
##                                        (Other):899                    
##  COMMERCIAL.UNITS  TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
##  Min.   :0        Min.   :1    Min.   :  345    Min.   : 308     
##  1st Qu.:0        1st Qu.:1    1st Qu.: 1800    1st Qu.:1178     
##  Median :0        Median :1    Median : 2000    Median :1408     
##  Mean   :0        Mean   :1    Mean   : 2357    Mean   :1559     
##  3rd Qu.:0        3rd Qu.:1    3rd Qu.: 2596    3rd Qu.:1800     
##  Max.   :0        Max.   :1    Max.   :10744    Max.   :3963     
##                                                                  
##    SALE.PRICE        SALE.DATE         
##  Min.   :   3500   Min.   :2018-04-02  
##  1st Qu.: 515000   1st Qu.:2018-06-28  
##  Median : 690000   Median :2018-09-20  
##  Mean   : 881913   Mean   :2018-09-23  
##  3rd Qu.: 995000   3rd Qu.:2018-12-17  
##  Max.   :8325000   Max.   :2019-03-29  
## 
single.fam.house$group <- "Single Family"
summary(multi.fam.no.comm)
##  NEIGHBORHOOD       BUILDING.CLASS.CATEGORY   YEAR.BUILT  
##  Length:9328        Length:9328             Min.   :   0  
##  Class :character   Class :character        1st Qu.:1910  
##  Mode  :character   Mode  :character        Median :1925  
##                                             Mean   :1812  
##                                             3rd Qu.:1997  
##                                             Max.   :2018  
##                                                           
##    ADDRESS          APARTMENT.NUMBER      ZIP.CODE    RESIDENTIAL.UNITS
##  Length:9328        Length:9328        11234  : 604   Min.   :0.000    
##  Class :character   Class :character   11236  : 475   1st Qu.:1.000    
##  Mode  :character   Mode  :character   11229  : 466   Median :1.000    
##                                        11215  : 408   Mean   :1.541    
##                                        11208  : 400   3rd Qu.:2.000    
##                                        11207  : 395   Max.   :3.000    
##                                        (Other):6580                    
##  COMMERCIAL.UNITS  TOTAL.UNITS     LAND.SQUARE.FEET GROSS.SQUARE.FEET
##  Min.   :0        Min.   : 0.000   Min.   :     0   Min.   : 200     
##  1st Qu.:0        1st Qu.: 1.000   1st Qu.:  1500   1st Qu.:1090     
##  Median :0        Median : 1.000   Median :  2000   Median :1573     
##  Mean   :0        Mean   : 1.555   Mean   :  3948   Mean   :1719     
##  3rd Qu.:0        3rd Qu.: 2.000   3rd Qu.:  2620   3rd Qu.:2256     
##  Max.   :0        Max.   :21.000   Max.   :146587   Max.   :4000     
##                                    NA's   :5                         
##    SALE.PRICE        SALE.DATE         
##  Min.   :   3500   Min.   :2018-04-02  
##  1st Qu.: 585587   1st Qu.:2018-06-26  
##  Median : 835000   Median :2018-09-17  
##  Mean   :1019442   Mean   :2018-09-21  
##  3rd Qu.:1250000   3rd Qu.:2018-12-14  
##  Max.   :8325000   Max.   :2019-03-29  
## 
multi.fam.no.comm$group <- "Multi Family No Comm."
summary(multi.fam.with.comm)
##  NEIGHBORHOOD       BUILDING.CLASS.CATEGORY   YEAR.BUILT  
##  Length:675         Length:675              Min.   :   0  
##  Class :character   Class :character        1st Qu.:1922  
##  Mode  :character   Mode  :character        Median :1931  
##                                             Mean   :1874  
##                                             3rd Qu.:1960  
##                                             Max.   :2017  
##                                                           
##    ADDRESS          APARTMENT.NUMBER      ZIP.CODE   RESIDENTIAL.UNITS
##  Length:675         Length:675         11204  : 50   Min.   :0.0000   
##  Class :character   Class :character   11249  : 36   1st Qu.:0.0000   
##  Mode  :character   Mode  :character   11234  : 30   Median :1.0000   
##                                        11208  : 29   Mean   :0.9452   
##                                        11203  : 27   3rd Qu.:2.0000   
##                                        11207  : 27   Max.   :3.0000   
##                                        (Other):476                    
##  COMMERCIAL.UNITS  TOTAL.UNITS     LAND.SQUARE.FEET GROSS.SQUARE.FEET
##  Min.   : 1.000   Min.   : 1.000   Min.   :     0   Min.   : 112     
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:  1500   1st Qu.:1013     
##  Median : 1.000   Median : 2.000   Median :  2000   Median :2336     
##  Mean   : 2.233   Mean   : 4.948   Mean   :  3658   Mean   :2110     
##  3rd Qu.: 1.000   3rd Qu.: 3.000   3rd Qu.:  2931   3rd Qu.:3130     
##  Max.   :22.000   Max.   :57.000   Max.   :137889   Max.   :4000     
##                                                                      
##    SALE.PRICE         SALE.DATE         
##  Min.   :    4900   Min.   :2018-04-02  
##  1st Qu.:  548836   1st Qu.:2018-06-28  
##  Median :  975000   Median :2018-10-03  
##  Mean   : 1339360   Mean   :2018-10-03  
##  3rd Qu.: 1622500   3rd Qu.:2019-01-02  
##  Max.   :32250000   Max.   :2019-03-29  
## 
multi.fam.with.comm$group <- "Multi Family With Comm."

total <- rbind(condos, single.fam.house, multi.fam.no.comm, multi.fam.with.comm)
condo.and.single.fam <- rbind(condos, single.fam.house)

Box Plot Comparsion:

In the BoxPlot bloew, we can see that the averages are close together, but each of the categories has several outliers, with commerical properties having the most extremely values.

ggplot(total, aes(x=total$GROSS.SQUARE.FEET, y=total$SALE.PRICE, color = total$group)) + geom_boxplot(notch = TRUE) +xlab("Gross Square Feet") + ylab("Sales Price") + labs(color="Property Type")

plot(reduced_result$LAND.SQUARE.FEET, reduced_result$SALE.PRICE, xlab = "Square Feet ", ylab = "Sales Price")

my_cols <- cols<-(randomColor(count = 39, hue = c(" ", "random", "red", "orange", "yellow",
"green", "blue", "purple", "pink", "monochrome"), luminosity = c(" ",
"random", "light", "bright", "dark")))
pairs(reduced_result[,7:10], pch= 19, cex = 0.5, col=my_cols[reduced_result$ZIP.CODE], lower.panel = NULL)

4 Inference

Does Gross Square Feet have a direct linear relationship to price in Brooklyn, NY?

Are the average property sales prices significantly different for condos, single family homes, multi-family homes with and without commerical units?

4.1 Conditions

Below we check the following conditions on the 4 subsets of data:

  • the normality of the data

Based on the graphs below, the data is skewed to the right, but we will proceed with the analysis understanding that these results are skewed.

par(mfrow=c(2,2))
qqnorm(condos$SALE.PRICE, main = "Condo Price QQNorm")
qqline(condos$SALE.PRICE)
qqnorm(condos$GROSS.SQUARE.FEET, main = "Condo SQ Ft QQNorm")
qqline(condos$GROSS.SQUARE.FEET)
hist(condos$SALE.PRICE, xlab = "Sales Price", ylab="Frequency", main = "Frequency Plot")
hist(condos$GROSS.SQUARE.FEET,xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")

par(mfrow=c(2,2))
qqnorm(single.fam.house$SALE.PRICE, main = "Single Fam. Price QQNorm")
qqline(single.fam.house$SALE.PRICE)
qqnorm(single.fam.house$GROSS.SQUARE.FEET,  main = "Single Fam. Sq Ft. QQNorm")
qqline(single.fam.house$GROSS.SQUARE.FEET)
hist(single.fam.house$SALE.PRICE, xlab = "Sales Price", ylab="Frequency",main = "Frequency Plot")
hist(single.fam.house$GROSS.SQUARE.FEET,xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")

par(mfrow=c(2,2))
qqnorm(multi.fam.no.comm$SALE.PRICE, main="Multi Fam. Price QQNorm")
qqline(multi.fam.no.comm$SALE.PRICE)
qqnorm(multi.fam.no.comm$GROSS.SQUARE.FEET, main="Multi Fam. Sq Ft. QQNorm")
qqline(multi.fam.no.comm$GROSS.SQUARE.FEET)
hist(multi.fam.no.comm$SALE.PRICE, xlab = "Sales Price", ylab="Frequency",main = "Frequency Plot")
hist(multi.fam.no.comm$GROSS.SQUARE.FEET, xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")

par(mfrow=c(2,2))
qqnorm(multi.fam.with.comm$SALE.PRICE, main="Multi Fam. w/c Price QQNorm")
qqline(multi.fam.with.comm$SALE.PRICE)
qqnorm(multi.fam.with.comm$GROSS.SQUARE.FEET, main="Multi Fam. w/c Sq Ft. QQNorm")
qqline(multi.fam.with.comm$GROSS.SQUARE.FEET)
hist(multi.fam.with.comm$SALE.PRICE, xlab = "Sales Price", ylab="Frequency",main = "Frequency Plot")
hist(multi.fam.with.comm$GROSS.SQUARE.FEET,xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")

4.2 Theoretical Inference

In this section, I perform the following tests on 4 subsets of data

  • Linear Regression and Diagnostic Plots
  • Nearly Normal Condition

4.3 Difference of means.

In this section, I am interested in determining if there is a significant difference of means in sales price between condos and single family homes:

model1 <- aov(condo.and.single.fam$SALE.PRICE ~ condo.and.single.fam$group)
summary(model1)
##                              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## condo.and.single.fam$group    1 2.164e+13 2.164e+13   43.01 5.95e-11 ***
## Residuals                  5323 2.677e+15 5.030e+11                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the F value above, we can see that there is a difference. However, we expect this to some extent. I want to test one more Anova sample for two neighborhoods of the same type:

Condos

  • Windsor Terrence
  • Kensington
ps <- condos %>% filter(NEIGHBORHOOD == "WINDSOR TERRACE")
wt <- condos %>% filter(NEIGHBORHOOD == "KENSINGTON")
both <- rbind(ps,wt)
model2 <- aov(both$SALE.PRICE ~ both$NEIGHBORHOOD)
summary(model2)
##                   Df    Sum Sq   Mean Sq F value Pr(>F)
## both$NEIGHBORHOOD  1 3.459e+09 3.459e+09   0.052  0.821
## Residuals         33 2.197e+12 6.659e+10

Based on the P value and F value, we can see that there is no statistical difference between the average sales prices of Condos in Kensington and Windsor Terrence.

Single Family Homes

  • Windsor Terrence
  • Kensington
ps <- single.fam.house %>% filter(NEIGHBORHOOD == "WINDSOR TERRACE")
wt <- single.fam.house %>% filter(NEIGHBORHOOD == "KENSINGTON")
both <- rbind(ps,wt)
model2 <- aov(both$SALE.PRICE ~ both$NEIGHBORHOOD)
summary(model2)
##                   Df   Sum Sq   Mean Sq F value Pr(>F)
## both$NEIGHBORHOOD  1 2.19e+09 2.190e+09   0.021  0.888
## Residuals         16 1.70e+12 1.062e+11

Based on the P value and F value, we can see that there is no statistical difference between the average sales prices of Single Family Homes in Kensington and Windsor Terrence.

4.4 Linear Regression

I tested this on the following sets of data to determine if linear regression can be used:

  • Condos Sales Price ~ It’s Square Footage
  • Single Family Homes Price ~ It’s Square Footage
  • Multi Family Home Prices with No Commerical Space ~ It’s Square Footage
  • Multi Family Home Prices with Commercial Space ~ It’s Square Footage

4.4.1 Condo Data Set

For the Condo dataset, we can see that we have a correlation value of 0.62, which shows that there is some positive relationship between square footage and price.

  • The Normal QQ Plot grows expontially towards higher square footage, so this is an item that we need to be mindful of.
cor(condos$SALE.PRICE, condos$GROSS.SQUARE.FEET)
## [1] 0.6261303
condo.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=condos)
summary(condo.lm)
## 
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = condos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2750509  -309725   -23464   234904  4473445 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       17714.19   23312.14    0.76    0.447    
## GROSS.SQUARE.FEET   921.97      19.68   46.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 555600 on 3404 degrees of freedom
## Multiple R-squared:  0.392,  Adjusted R-squared:  0.3919 
## F-statistic:  2195 on 1 and 3404 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(condos$SALE.PRICE ~ condos$GROSS.SQUARE.FEET, xlab="Square Feet", ylab="Sale Price", main="Condos Sales Price vs. Sqft")
abline(condo.lm)
# model diagnostics - linearity and nearly normal residuals
plot(condo.lm$residuals ~ condos$SALE.PRICE, xlab="Sale Price", ylab="Residuals", main="Residuals vs. Condos Sales Price")
abline(h = 0, lty = 3)
hist(condo.lm$residuals, xlab="Residuals", main="Condo Residuals")
qqnorm(condo.lm$residuals)
qqline(condo.lm$residuals)

# ------

4.4.2 Single Family Data Set

For the Single Family dataset, we can see that we have a correlation value of 0.58, which shows that there is some positive relationship between square footage and price, but lower than that of a condo.

  • The Normal QQ Plot grows expontially towards higher square footage, so this is an item that we need to be mindful of.
cor(single.fam.house$SALE.PRICE, single.fam.house$GROSS.SQUARE.FEET)
## [1] 0.5816722
single.fam.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=single.fam.house)
summary(single.fam.lm)
## 
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = single.fam.house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2030732  -270503   -54623   172477  6545986 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -209852.02   37236.93  -5.636    2e-08 ***
## GROSS.SQUARE.FEET     700.30      22.37  31.309   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 572300 on 1917 degrees of freedom
## Multiple R-squared:  0.3383, Adjusted R-squared:  0.338 
## F-statistic: 980.3 on 1 and 1917 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(single.fam.house$SALE.PRICE ~ single.fam.house$GROSS.SQUARE.FEET,xlab="Square Feet", ylab="Sale Price", main="Single Family Sales Price vs. Sqft")
abline(single.fam.lm)
# model diagnostics - linearity and nearly normal residuals
plot(single.fam.lm$residuals ~ single.fam.house$SALE.PRICE,xlab="Sale Price", ylab="Residuals", main="Residuals vs. Single Family Sales Price")
abline(h = 0, lty = 3)
hist(single.fam.lm$residuals, breaks=30,xlab="Residuals", main="Single Family Residuals")
qqnorm(single.fam.lm$residuals)
qqline(single.fam.lm$residuals)

# ------

4.4.3 Multi Family with no commerical

For the Multi-Family dataset, we can see that we have a correlation value of 0.36, which shows that there is some positive relationship between square footage and price, but not as strong as condos or single family homes.

  • The Normal QQ Plot grows expontially towards higher square footage, so this is an item that we need to be mindful of.
cor(multi.fam.no.comm$SALE.PRICE, multi.fam.no.comm$GROSS.SQUARE.FEET)
## [1] 0.3655884
multi.fam.no.comm.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=multi.fam.no.comm)
summary(multi.fam.no.comm.lm)
## 
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = multi.fam.no.comm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1751297  -379759  -121228   207819  6936088 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.527e+05  1.649e+04   27.45   <2e-16 ***
## GROSS.SQUARE.FEET 3.296e+02  8.691e+00   37.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 674600 on 9326 degrees of freedom
## Multiple R-squared:  0.1337, Adjusted R-squared:  0.1336 
## F-statistic:  1439 on 1 and 9326 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(multi.fam.no.comm$SALE.PRICE ~ multi.fam.no.comm$GROSS.SQUARE.FEET,xlab="Square Feet", ylab="Sale Price", main="Multi Fam. w/o c.  Sales Price vs. Sqft")
abline(multi.fam.no.comm.lm)
# model diagnostics - linearity and nearly normal residuals
plot(multi.fam.no.comm.lm$residuals ~ multi.fam.no.comm$SALE.PRICE,xlab="Sale Price", ylab="Residuals", main="Residuals vs. Multi Fam. w/o c. Sales Price")
abline(h = 0, lty = 3)
hist(multi.fam.no.comm.lm$residuals, breaks=30,xlab="Residuals", main="Multi Fam. w/o c. Residuals")
qqnorm(multi.fam.no.comm.lm$residuals)
qqline(multi.fam.no.comm.lm$residuals)

# ------

4.4.4 Multi Family with commerical

For the Multi-Family dataset, we can see that we have a correlation value of 0.26, which shows that there is some positive relationship between square footage and price, but even less stronger than the other three data points.

  • The Normal QQ Plot grows expontially towards higher square footage, so this is an item that we need to be mindful of.
cor(multi.fam.with.comm$SALE.PRICE, multi.fam.with.comm$GROSS.SQUARE.FEET)
## [1] 0.2640267
multi.fam.with.comm.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=multi.fam.with.comm)
summary(multi.fam.with.comm.lm)
## 
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = multi.fam.with.comm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2065055  -615608  -377012   182475 31080472 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       465939.43  142541.78   3.269  0.00114 ** 
## GROSS.SQUARE.FEET    413.88      58.28   7.101 3.15e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1872000 on 673 degrees of freedom
## Multiple R-squared:  0.06971,    Adjusted R-squared:  0.06833 
## F-statistic: 50.43 on 1 and 673 DF,  p-value: 3.148e-12
par(mfrow=c(2,2))
plot(multi.fam.with.comm$SALE.PRICE ~ multi.fam.with.comm$GROSS.SQUARE.FEET,xlab="Square Feet", ylab="Sale Price", main="Multi Fam. w c.  Sales Price vs. Sqft")
abline(multi.fam.with.comm.lm)
# model diagnostics - linearity and nearly normal residuals
plot(multi.fam.with.comm.lm$residuals ~ multi.fam.with.comm$SALE.PRICE,xlab="Sale Price", ylab="Residuals", main="Residuals vs. Multi Fam. w. c. Sales Price")
abline(h = 0, lty = 3)
hist(multi.fam.with.comm.lm$residuals, breaks=30,xlab="Residuals", main="Multi Fam. w. c. Residuals")
qqnorm(multi.fam.with.comm.lm$residuals)
qqline(multi.fam.with.comm.lm$residuals)

# ------

5 Conclusion

We originally set out to answer the following questions:

  • Do single family condos and single family homes in Brooklyn, NY sell at a similar average?

Based on our Anova test, we determined that the difference in means is statisitically different. We expected this due to the different types of properties. However, I wanted to check the difference of means between two neighorhoods as well. Based on that Anova test, the P value was 85%, which indicated that the difference between means was not statisitically significant.

In addition to the test above, we also did comparsions on two neighborhoods for both types of properties:

  • Windsor Terrence
  • Kensington

For both ANOVA tests, we found that the P value was higher between the neighorhoods, which indicates that the averages are not statisifically significant in difference.

  • If you are a family trying to purchase a home in NYC, or anywhere else in the city, are there specific properties types that I should target for affordability?

While not statistical in nature, based on the data gathered, we can see that certain types of properties sell much higher in Brooklyn, in the following increasing order:

  • Single Family
  • Condos
  • Multi Families - No Commerical
  • Multi Families – With Commerical

  • Is there a clear correlation between square footage and the price of a property? Is it consistent for all three types of properties collected here?

For all of the following regression tests, there is a positive correlation factor, however, the data is not linear and has an expential curve towards the higher sales prices. While there is a positive correlation value for each of the property types, we cannot use the linear regression model to predict the value of properties at a higher level as the data are skewed to the right.

  • Single Family
  • Condos
  • Multi-Family with no commerical units
  • Multi-Family with commerical units

Final Thoughts and Lessons Learned:

Predicting models is extremely complex. Even when data appears to be nearly normal, I found that the minute you run into non-linear data, it trying to fit a model can be very challenging and requires time and practice to become an expert. It is impossible to find a perfect model that will fit data. I found that I can shape data to be closer to a model, but it tampers the information and can invalidate any findings provided. This illustrates why many people feel that you can easily lie with statistics.