New York City properties have skyrocked over the last decade. As part of this project, I am trying to understand if this trend is across all types of property (multi-units vs. single family) in the city. e.g. NYC as whole, or if it is isolated to just Brooklyn. One of the challenges I will face with this project is that my dataset is limited to sales of properties in Brooklyn over the last year. I may not be able to infer on NYC and or even on the entire property types across the US. What I can probably infer is whether all properties of a certain type are being affected more or less around certain zips code within Brooklyn. I think this is important to understand for several reasons:
In this project, I try to answer the following questions based on the data that is available to me:
Do single family condos and single family homes in Brooklyn, NY sell at a similar average?
If you are a family trying to purchase a home in NYC, or anywhere else in the city, are there specific properties types that I should target for affordability?
Is there a clear correlation between square footage and the price of a property? Is it consistent for all three types of properties collected here?
I will try to answer the questions and above and make inferences on the data.
The NYC department of finance publishes properties sold within the last twelve months on their site. It includes several variables that I was interested in analyzing. For example, some of the data available includes the following:
And several other variables.
I pulled an XLS file from the NYC.Gov site and then converted it to a CSV file. I uploaded the converted file to Github to ensure that I could download the data from a stable and reliable repository.
The NYC Department of Finance contains data for all the boros, but I chose to focus solely on Brooklyn, NY for this analysis.
Data Filters:
Because of the type of properties I am interested in, I chose to filter properties on the following conditions:
The reason for filtering properties over 5000 sq ft is due to these properties having special conditions that the average person will not be able to afford: ~ 1% of the population.
# load data
library(httr)
library(RCurl)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
csv_file = "https://raw.githubusercontent.com/dapolloxp/R-Projects/master/rollingsales_brooklyn_2019.csv"
out_file <- getURL(csv_file )
property.data <- read.csv(text = out_file, stringsAsFactors = FALSE)
reduced_result <- property.data %>% select("NEIGHBORHOOD","BUILDING.CLASS.CATEGORY", "YEAR.BUILT","ADDRESS", "APARTMENT.NUMBER","ZIP.CODE","RESIDENTIAL.UNITS", "COMMERCIAL.UNITS", "TOTAL.UNITS", "LAND.SQUARE.FEET", "GROSS.SQUARE.FEET", "SALE.PRICE", "SALE.DATE")
## convert sales price to numeric
reduced_result[["SALE.PRICE"]] <- as.numeric((gsub("[^0-9]","", reduced_result[["SALE.PRICE"]])))
## convert sales date from character to date
reduced_result[["SALE.DATE"]] <- as.Date(reduced_result[["SALE.DATE"]], "%m/%d/%y")
x<-as.character(reduced_result[["ZIP.CODE"]])
reduced_result[["ZIP.CODE"]] <- as.factor(x)
##
reduced_result[["LAND.SQUARE.FEET"]] <- as.numeric((gsub("[^0-9]","", reduced_result[["LAND.SQUARE.FEET"]])))
##
reduced_result[["GROSS.SQUARE.FEET"]] <- as.numeric((gsub("[^0-9]","", reduced_result[["GROSS.SQUARE.FEET"]])))
reduced_result[["TOTAL.UNITS"]] <- as.integer(reduced_result[["TOTAL.UNITS"]])
reduced_result[["RESIDENTIAL.UNITS"]] <- as.integer(reduced_result[["RESIDENTIAL.UNITS"]])
reduced_result[["ADDRESS"]] <- as.character(reduced_result[["ADDRESS"]])
##leaflet() %>% addTiles() %>% setView(-73.994, 40.6782, zoom = 11)
### Removing any entries that are for 2017
sales_2018_2019 <- subset(reduced_result, reduced_result$SALE.DATE > "2017-12-31")
#reduced_result
## Since I want to do computation on price, I am filtering any values that do not contain price sales
#recorded_sales_2018_2019 <- sales_2018_2019 %>% filter(SALE.PRICE != "NA")
#recorded_sales_2018_2019 <- sales_2018_2019 %>% filter(SALE.PRICE != "NA" | SALE.PRICE == 0)
recorded_sales_2018_2019 <- sales_2018_2019 %>% filter(!is.na(SALE.PRICE) & GROSS.SQUARE.FEET <= 4000 & GROSS.SQUARE.FEET > 0 & SALE.PRICE >= 3500)
## Below is a summary of sale prices
## The medium price is $790,800 for Brooklyn and the average sale price for 2018 was $1,356,196 million.
The cases are the total number of sales in Kings County (Brooklyn, NY). In this dataset there are 22924 observations. However, for the data that I am looking to assess, I must filter out all properties that are not single family or multi-family properties of up to 3 families. In addition, although this could potentially impact my analysis, I also removed several transfers of properties where the recorded value was 0. Per the NYC website, these values indicate that these property transfers were done as a gift. I also divided properties with commercial units and no commercial units as I believe that this will impact any regression techniques that I leverage in my inference section.
multi.fam.no.comm <- recorded_sales_2018_2019[recorded_sales_2018_2019$RESIDENTIAL.UNITS < 4 & recorded_sales_2018_2019$COMMERCIAL.UNITS == 0,]
multi.fam.with.comm <- recorded_sales_2018_2019[recorded_sales_2018_2019$RESIDENTIAL.UNITS < 4 & recorded_sales_2018_2019$COMMERCIAL.UNITS > 0,]
single.fam.house <- recorded_sales_2018_2019[recorded_sales_2018_2019$RESIDENTIAL.UNITS == 1 & recorded_sales_2018_2019$COMMERCIAL.UNITS == 0 & recorded_sales_2018_2019$BUILDING.CLASS.CATEGORY == '01 ONE FAMILY DWELLINGS',]
condos <- recorded_sales_2018_2019[str_detect(recorded_sales_2018_2019$BUILDING.CLASS.CATEGORY, "CONDOS"),] %>% filter(RESIDENTIAL.UNITS == 1)
What is the response variable? Is it quantitative or qualitative?
The response variable is the sales price and it happens to be quantitative in this case.
You should have two independent variables, one quantitative and one qualitative.
The independent variables are land.square.feet (quantitative), and neighborhood (qualitative). I can also use zip code in place of neighborhood.
For my Brooklyn, NY property analysis study, my data is considered an observational study since all of these are part recorded events. There are no control groups or splitting of data.
Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.
For the population of interest, I am looking to assess all Brooklyn homes that consist of single family houses or condos. I am also including multi-family homes with and without commerical units. The general findings of this analysis will not be able to be generalized for the entire population, in this case, all homes in the US. It could potentially be used to assess all homes in NYC, but it is doubtful due to the lack of data from the other boros. In addition, NYC is known to be expensive compared to the rest of the United States, so there known fact prevents the conclusions in this study from being generalized.
Can these data be used to establish causal links between the variables of interest? Explain why or why not.
The data included in this study can be used to establish causal links. For example, it is to be expected that square footage should correlate with the price of a home, however, we question whether the following are true:
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
The summary function is used below. For some of the variables, it doesn’t make sense to use the summary statistics, but I have included them for completeness.
For the boxplot, I removed outliers as it would skew the results significantly.
library(lattice)
library(randomcoloR)
summary(condos)
## NEIGHBORHOOD BUILDING.CLASS.CATEGORY YEAR.BUILT
## Length:3406 Length:3406 Min. : 0
## Class :character Class :character 1st Qu.:1910
## Mode :character Mode :character Median :2007
## Mean :1617
## 3rd Qu.:2015
## Max. :2018
##
## ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
## Length:3406 Length:3406 11201 : 302 Min. :1
## Class :character Class :character 11215 : 249 1st Qu.:1
## Mode :character Mode :character 11217 : 220 Median :1
## 11238 : 220 Mean :1
## 11249 : 201 3rd Qu.:1
## 11211 : 195 Max. :1
## (Other):2019
## COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
## Min. :-1.0000 Min. : 1.000 Min. : 0 Min. : 349
## 1st Qu.: 0.0000 1st Qu.: 1.000 1st Qu.: 0 1st Qu.: 736
## Median : 0.0000 Median : 1.000 Median : 0 Median :1011
## Mean : 0.2284 Mean : 1.617 Mean : 6906 Mean :1081
## 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 5000 3rd Qu.:1271
## Max. :22.0000 Max. :57.000 Max. :146587 Max. :4000
## NA's :5
## SALE.PRICE SALE.DATE
## Min. : 9300 Min. :2018-04-02
## 1st Qu.: 595676 1st Qu.:2018-06-19
## Median : 814600 Median :2018-09-07
## Mean :1014677 Mean :2018-09-14
## 3rd Qu.:1239750 3rd Qu.:2018-12-05
## Max. :8128350 Max. :2019-03-29
##
condos$group <- "Condo"
summary(single.fam.house)
## NEIGHBORHOOD BUILDING.CLASS.CATEGORY YEAR.BUILT
## Length:1919 Length:1919 Min. : 0
## Class :character Class :character 1st Qu.:1920
## Mode :character Mode :character Median :1925
## Mean :1928
## 3rd Qu.:1935
## Max. :2017
##
## ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
## Length:1919 Length:1919 11234 :347 Min. :1
## Class :character Class :character 11229 :194 1st Qu.:1
## Mode :character Mode :character 11210 :149 Median :1
## 11203 :128 Mean :1
## 11236 :114 3rd Qu.:1
## 11209 : 88 Max. :1
## (Other):899
## COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
## Min. :0 Min. :1 Min. : 345 Min. : 308
## 1st Qu.:0 1st Qu.:1 1st Qu.: 1800 1st Qu.:1178
## Median :0 Median :1 Median : 2000 Median :1408
## Mean :0 Mean :1 Mean : 2357 Mean :1559
## 3rd Qu.:0 3rd Qu.:1 3rd Qu.: 2596 3rd Qu.:1800
## Max. :0 Max. :1 Max. :10744 Max. :3963
##
## SALE.PRICE SALE.DATE
## Min. : 3500 Min. :2018-04-02
## 1st Qu.: 515000 1st Qu.:2018-06-28
## Median : 690000 Median :2018-09-20
## Mean : 881913 Mean :2018-09-23
## 3rd Qu.: 995000 3rd Qu.:2018-12-17
## Max. :8325000 Max. :2019-03-29
##
single.fam.house$group <- "Single Family"
summary(multi.fam.no.comm)
## NEIGHBORHOOD BUILDING.CLASS.CATEGORY YEAR.BUILT
## Length:9328 Length:9328 Min. : 0
## Class :character Class :character 1st Qu.:1910
## Mode :character Mode :character Median :1925
## Mean :1812
## 3rd Qu.:1997
## Max. :2018
##
## ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
## Length:9328 Length:9328 11234 : 604 Min. :0.000
## Class :character Class :character 11236 : 475 1st Qu.:1.000
## Mode :character Mode :character 11229 : 466 Median :1.000
## 11215 : 408 Mean :1.541
## 11208 : 400 3rd Qu.:2.000
## 11207 : 395 Max. :3.000
## (Other):6580
## COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
## Min. :0 Min. : 0.000 Min. : 0 Min. : 200
## 1st Qu.:0 1st Qu.: 1.000 1st Qu.: 1500 1st Qu.:1090
## Median :0 Median : 1.000 Median : 2000 Median :1573
## Mean :0 Mean : 1.555 Mean : 3948 Mean :1719
## 3rd Qu.:0 3rd Qu.: 2.000 3rd Qu.: 2620 3rd Qu.:2256
## Max. :0 Max. :21.000 Max. :146587 Max. :4000
## NA's :5
## SALE.PRICE SALE.DATE
## Min. : 3500 Min. :2018-04-02
## 1st Qu.: 585587 1st Qu.:2018-06-26
## Median : 835000 Median :2018-09-17
## Mean :1019442 Mean :2018-09-21
## 3rd Qu.:1250000 3rd Qu.:2018-12-14
## Max. :8325000 Max. :2019-03-29
##
multi.fam.no.comm$group <- "Multi Family No Comm."
summary(multi.fam.with.comm)
## NEIGHBORHOOD BUILDING.CLASS.CATEGORY YEAR.BUILT
## Length:675 Length:675 Min. : 0
## Class :character Class :character 1st Qu.:1922
## Mode :character Mode :character Median :1931
## Mean :1874
## 3rd Qu.:1960
## Max. :2017
##
## ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
## Length:675 Length:675 11204 : 50 Min. :0.0000
## Class :character Class :character 11249 : 36 1st Qu.:0.0000
## Mode :character Mode :character 11234 : 30 Median :1.0000
## 11208 : 29 Mean :0.9452
## 11203 : 27 3rd Qu.:2.0000
## 11207 : 27 Max. :3.0000
## (Other):476
## COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
## Min. : 1.000 Min. : 1.000 Min. : 0 Min. : 112
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1500 1st Qu.:1013
## Median : 1.000 Median : 2.000 Median : 2000 Median :2336
## Mean : 2.233 Mean : 4.948 Mean : 3658 Mean :2110
## 3rd Qu.: 1.000 3rd Qu.: 3.000 3rd Qu.: 2931 3rd Qu.:3130
## Max. :22.000 Max. :57.000 Max. :137889 Max. :4000
##
## SALE.PRICE SALE.DATE
## Min. : 4900 Min. :2018-04-02
## 1st Qu.: 548836 1st Qu.:2018-06-28
## Median : 975000 Median :2018-10-03
## Mean : 1339360 Mean :2018-10-03
## 3rd Qu.: 1622500 3rd Qu.:2019-01-02
## Max. :32250000 Max. :2019-03-29
##
multi.fam.with.comm$group <- "Multi Family With Comm."
total <- rbind(condos, single.fam.house, multi.fam.no.comm, multi.fam.with.comm)
condo.and.single.fam <- rbind(condos, single.fam.house)
Box Plot Comparsion:
In the BoxPlot bloew, we can see that the averages are close together, but each of the categories has several outliers, with commerical properties having the most extremely values.
ggplot(total, aes(x=total$GROSS.SQUARE.FEET, y=total$SALE.PRICE, color = total$group)) + geom_boxplot(notch = TRUE) +xlab("Gross Square Feet") + ylab("Sales Price") + labs(color="Property Type")
plot(reduced_result$LAND.SQUARE.FEET, reduced_result$SALE.PRICE, xlab = "Square Feet ", ylab = "Sales Price")
my_cols <- cols<-(randomColor(count = 39, hue = c(" ", "random", "red", "orange", "yellow",
"green", "blue", "purple", "pink", "monochrome"), luminosity = c(" ",
"random", "light", "bright", "dark")))
pairs(reduced_result[,7:10], pch= 19, cex = 0.5, col=my_cols[reduced_result$ZIP.CODE], lower.panel = NULL)
Does Gross Square Feet have a direct linear relationship to price in Brooklyn, NY?
Are the average property sales prices significantly different for condos, single family homes, multi-family homes with and without commerical units?
Below we check the following conditions on the 4 subsets of data:
Based on the graphs below, the data is skewed to the right, but we will proceed with the analysis understanding that these results are skewed.
par(mfrow=c(2,2))
qqnorm(condos$SALE.PRICE, main = "Condo Price QQNorm")
qqline(condos$SALE.PRICE)
qqnorm(condos$GROSS.SQUARE.FEET, main = "Condo SQ Ft QQNorm")
qqline(condos$GROSS.SQUARE.FEET)
hist(condos$SALE.PRICE, xlab = "Sales Price", ylab="Frequency", main = "Frequency Plot")
hist(condos$GROSS.SQUARE.FEET,xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")
par(mfrow=c(2,2))
qqnorm(single.fam.house$SALE.PRICE, main = "Single Fam. Price QQNorm")
qqline(single.fam.house$SALE.PRICE)
qqnorm(single.fam.house$GROSS.SQUARE.FEET, main = "Single Fam. Sq Ft. QQNorm")
qqline(single.fam.house$GROSS.SQUARE.FEET)
hist(single.fam.house$SALE.PRICE, xlab = "Sales Price", ylab="Frequency",main = "Frequency Plot")
hist(single.fam.house$GROSS.SQUARE.FEET,xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")
par(mfrow=c(2,2))
qqnorm(multi.fam.no.comm$SALE.PRICE, main="Multi Fam. Price QQNorm")
qqline(multi.fam.no.comm$SALE.PRICE)
qqnorm(multi.fam.no.comm$GROSS.SQUARE.FEET, main="Multi Fam. Sq Ft. QQNorm")
qqline(multi.fam.no.comm$GROSS.SQUARE.FEET)
hist(multi.fam.no.comm$SALE.PRICE, xlab = "Sales Price", ylab="Frequency",main = "Frequency Plot")
hist(multi.fam.no.comm$GROSS.SQUARE.FEET, xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")
par(mfrow=c(2,2))
qqnorm(multi.fam.with.comm$SALE.PRICE, main="Multi Fam. w/c Price QQNorm")
qqline(multi.fam.with.comm$SALE.PRICE)
qqnorm(multi.fam.with.comm$GROSS.SQUARE.FEET, main="Multi Fam. w/c Sq Ft. QQNorm")
qqline(multi.fam.with.comm$GROSS.SQUARE.FEET)
hist(multi.fam.with.comm$SALE.PRICE, xlab = "Sales Price", ylab="Frequency",main = "Frequency Plot")
hist(multi.fam.with.comm$GROSS.SQUARE.FEET,xlab = "Gross Square Feet", ylab="Frequency",main = "Frequency Plot")
In this section, I perform the following tests on 4 subsets of data
In this section, I am interested in determining if there is a significant difference of means in sales price between condos and single family homes:
model1 <- aov(condo.and.single.fam$SALE.PRICE ~ condo.and.single.fam$group)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## condo.and.single.fam$group 1 2.164e+13 2.164e+13 43.01 5.95e-11 ***
## Residuals 5323 2.677e+15 5.030e+11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the F value above, we can see that there is a difference. However, we expect this to some extent. I want to test one more Anova sample for two neighborhoods of the same type:
Condos
ps <- condos %>% filter(NEIGHBORHOOD == "WINDSOR TERRACE")
wt <- condos %>% filter(NEIGHBORHOOD == "KENSINGTON")
both <- rbind(ps,wt)
model2 <- aov(both$SALE.PRICE ~ both$NEIGHBORHOOD)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## both$NEIGHBORHOOD 1 3.459e+09 3.459e+09 0.052 0.821
## Residuals 33 2.197e+12 6.659e+10
Based on the P value and F value, we can see that there is no statistical difference between the average sales prices of Condos in Kensington and Windsor Terrence.
Single Family Homes
ps <- single.fam.house %>% filter(NEIGHBORHOOD == "WINDSOR TERRACE")
wt <- single.fam.house %>% filter(NEIGHBORHOOD == "KENSINGTON")
both <- rbind(ps,wt)
model2 <- aov(both$SALE.PRICE ~ both$NEIGHBORHOOD)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## both$NEIGHBORHOOD 1 2.19e+09 2.190e+09 0.021 0.888
## Residuals 16 1.70e+12 1.062e+11
Based on the P value and F value, we can see that there is no statistical difference between the average sales prices of Single Family Homes in Kensington and Windsor Terrence.
I tested this on the following sets of data to determine if linear regression can be used:
For the Condo dataset, we can see that we have a correlation value of 0.62, which shows that there is some positive relationship between square footage and price.
cor(condos$SALE.PRICE, condos$GROSS.SQUARE.FEET)
## [1] 0.6261303
condo.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=condos)
summary(condo.lm)
##
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = condos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2750509 -309725 -23464 234904 4473445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17714.19 23312.14 0.76 0.447
## GROSS.SQUARE.FEET 921.97 19.68 46.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 555600 on 3404 degrees of freedom
## Multiple R-squared: 0.392, Adjusted R-squared: 0.3919
## F-statistic: 2195 on 1 and 3404 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(condos$SALE.PRICE ~ condos$GROSS.SQUARE.FEET, xlab="Square Feet", ylab="Sale Price", main="Condos Sales Price vs. Sqft")
abline(condo.lm)
# model diagnostics - linearity and nearly normal residuals
plot(condo.lm$residuals ~ condos$SALE.PRICE, xlab="Sale Price", ylab="Residuals", main="Residuals vs. Condos Sales Price")
abline(h = 0, lty = 3)
hist(condo.lm$residuals, xlab="Residuals", main="Condo Residuals")
qqnorm(condo.lm$residuals)
qqline(condo.lm$residuals)
# ------
For the Single Family dataset, we can see that we have a correlation value of 0.58, which shows that there is some positive relationship between square footage and price, but lower than that of a condo.
cor(single.fam.house$SALE.PRICE, single.fam.house$GROSS.SQUARE.FEET)
## [1] 0.5816722
single.fam.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=single.fam.house)
summary(single.fam.lm)
##
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = single.fam.house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2030732 -270503 -54623 172477 6545986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -209852.02 37236.93 -5.636 2e-08 ***
## GROSS.SQUARE.FEET 700.30 22.37 31.309 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 572300 on 1917 degrees of freedom
## Multiple R-squared: 0.3383, Adjusted R-squared: 0.338
## F-statistic: 980.3 on 1 and 1917 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(single.fam.house$SALE.PRICE ~ single.fam.house$GROSS.SQUARE.FEET,xlab="Square Feet", ylab="Sale Price", main="Single Family Sales Price vs. Sqft")
abline(single.fam.lm)
# model diagnostics - linearity and nearly normal residuals
plot(single.fam.lm$residuals ~ single.fam.house$SALE.PRICE,xlab="Sale Price", ylab="Residuals", main="Residuals vs. Single Family Sales Price")
abline(h = 0, lty = 3)
hist(single.fam.lm$residuals, breaks=30,xlab="Residuals", main="Single Family Residuals")
qqnorm(single.fam.lm$residuals)
qqline(single.fam.lm$residuals)
# ------
For the Multi-Family dataset, we can see that we have a correlation value of 0.36, which shows that there is some positive relationship between square footage and price, but not as strong as condos or single family homes.
cor(multi.fam.no.comm$SALE.PRICE, multi.fam.no.comm$GROSS.SQUARE.FEET)
## [1] 0.3655884
multi.fam.no.comm.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=multi.fam.no.comm)
summary(multi.fam.no.comm.lm)
##
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = multi.fam.no.comm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1751297 -379759 -121228 207819 6936088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.527e+05 1.649e+04 27.45 <2e-16 ***
## GROSS.SQUARE.FEET 3.296e+02 8.691e+00 37.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 674600 on 9326 degrees of freedom
## Multiple R-squared: 0.1337, Adjusted R-squared: 0.1336
## F-statistic: 1439 on 1 and 9326 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(multi.fam.no.comm$SALE.PRICE ~ multi.fam.no.comm$GROSS.SQUARE.FEET,xlab="Square Feet", ylab="Sale Price", main="Multi Fam. w/o c. Sales Price vs. Sqft")
abline(multi.fam.no.comm.lm)
# model diagnostics - linearity and nearly normal residuals
plot(multi.fam.no.comm.lm$residuals ~ multi.fam.no.comm$SALE.PRICE,xlab="Sale Price", ylab="Residuals", main="Residuals vs. Multi Fam. w/o c. Sales Price")
abline(h = 0, lty = 3)
hist(multi.fam.no.comm.lm$residuals, breaks=30,xlab="Residuals", main="Multi Fam. w/o c. Residuals")
qqnorm(multi.fam.no.comm.lm$residuals)
qqline(multi.fam.no.comm.lm$residuals)
# ------
For the Multi-Family dataset, we can see that we have a correlation value of 0.26, which shows that there is some positive relationship between square footage and price, but even less stronger than the other three data points.
cor(multi.fam.with.comm$SALE.PRICE, multi.fam.with.comm$GROSS.SQUARE.FEET)
## [1] 0.2640267
multi.fam.with.comm.lm <- lm(SALE.PRICE ~ GROSS.SQUARE.FEET, data=multi.fam.with.comm)
summary(multi.fam.with.comm.lm)
##
## Call:
## lm(formula = SALE.PRICE ~ GROSS.SQUARE.FEET, data = multi.fam.with.comm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2065055 -615608 -377012 182475 31080472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 465939.43 142541.78 3.269 0.00114 **
## GROSS.SQUARE.FEET 413.88 58.28 7.101 3.15e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1872000 on 673 degrees of freedom
## Multiple R-squared: 0.06971, Adjusted R-squared: 0.06833
## F-statistic: 50.43 on 1 and 673 DF, p-value: 3.148e-12
par(mfrow=c(2,2))
plot(multi.fam.with.comm$SALE.PRICE ~ multi.fam.with.comm$GROSS.SQUARE.FEET,xlab="Square Feet", ylab="Sale Price", main="Multi Fam. w c. Sales Price vs. Sqft")
abline(multi.fam.with.comm.lm)
# model diagnostics - linearity and nearly normal residuals
plot(multi.fam.with.comm.lm$residuals ~ multi.fam.with.comm$SALE.PRICE,xlab="Sale Price", ylab="Residuals", main="Residuals vs. Multi Fam. w. c. Sales Price")
abline(h = 0, lty = 3)
hist(multi.fam.with.comm.lm$residuals, breaks=30,xlab="Residuals", main="Multi Fam. w. c. Residuals")
qqnorm(multi.fam.with.comm.lm$residuals)
qqline(multi.fam.with.comm.lm$residuals)
# ------
We originally set out to answer the following questions:
Based on our Anova test, we determined that the difference in means is statisitically different. We expected this due to the different types of properties. However, I wanted to check the difference of means between two neighorhoods as well. Based on that Anova test, the P value was 85%, which indicated that the difference between means was not statisitically significant.
In addition to the test above, we also did comparsions on two neighborhoods for both types of properties:
For both ANOVA tests, we found that the P value was higher between the neighorhoods, which indicates that the averages are not statisifically significant in difference.
While not statistical in nature, based on the data gathered, we can see that certain types of properties sell much higher in Brooklyn, in the following increasing order:
Multi Families – With Commerical
Is there a clear correlation between square footage and the price of a property? Is it consistent for all three types of properties collected here?
For all of the following regression tests, there is a positive correlation factor, however, the data is not linear and has an expential curve towards the higher sales prices. While there is a positive correlation value for each of the property types, we cannot use the linear regression model to predict the value of properties at a higher level as the data are skewed to the right.
Final Thoughts and Lessons Learned:
Predicting models is extremely complex. Even when data appears to be nearly normal, I found that the minute you run into non-linear data, it trying to fit a model can be very challenging and requires time and practice to become an expert. It is impossible to find a perfect model that will fit data. I found that I can shape data to be closer to a model, but it tampers the information and can invalidate any findings provided. This illustrates why many people feel that you can easily lie with statistics.