#Importing the data from Kaggle
#library(readr)
mydata <-read.csv ("~/Desktop/IMB/NY-House-Dataset.csv",header= TRUE,sep=",", dec=".")
head(mydata)
## BROKERTITLE TYPE
## 1 Brokered by Douglas Elliman -111 Fifth Ave Condo for sale
## 2 Brokered by Serhant Condo for sale
## 3 Brokered by Sowae Corp House for sale
## 4 Brokered by COMPASS Condo for sale
## 5 Brokered by Sotheby's International Realty - East Side Manhattan Brokerage Townhouse for sale
## 6 Brokered by Sowae Corp House for sale
## PRICE BEDS BATH PROPERTYSQFT
## 1 315000 2 2.000000 1400
## 2 195000000 7 10.000000 17545
## 3 260000 4 2.000000 2015
## 4 69000 3 1.000000 445
## 5 55000000 7 2.373861 14175
## 6 690000 5 2.000000 4004
## ADDRESS STATE
## 1 2 E 55th St Unit 803 New York, NY 10022
## 2 Central Park Tower Penthouse-217 W 57th New York St Unit Penthouse New York, NY 10019
## 3 620 Sinclair Ave Staten Island, NY 10312
## 4 2 E 55th St Unit 908W33 Manhattan, NY 10022
## 5 5 E 64th St New York, NY 10065
## 6 584 Park Pl Brooklyn, NY 11238
## MAIN_ADDRESS
## 1 2 E 55th St Unit 803New York, NY 10022
## 2 Central Park Tower Penthouse-217 W 57th New York St Unit PenthouseNew York, NY 10019
## 3 620 Sinclair AveStaten Island, NY 10312
## 4 2 E 55th St Unit 908W33Manhattan, NY 10022
## 5 5 E 64th StNew York, NY 10065
## 6 584 Park PlBrooklyn, NY 11238
## ADMINISTRATIVE_AREA_LEVEL_2 LOCALITY SUBLOCALITY STREET_NAME LONG_NAME
## 1 New York County New York Manhattan East 55th Street Regis Residence
## 2 United States New York New York County New York West 57th Street
## 3 United States New York Richmond County Staten Island Sinclair Avenue
## 4 United States New York New York County New York East 55th Street
## 5 United States New York New York County New York East 64th Street
## 6 United States New York Kings County Brooklyn Park Place
## FORMATTED_ADDRESS LATITUDE LONGITUDE
## 1 Regis Residence, 2 E 55th St #803, New York, NY 10022, USA 40.76125 -73.97448
## 2 217 W 57th St, New York, NY 10019, USA 40.76639 -73.98099
## 3 620 Sinclair Ave, Staten Island, NY 10312, USA 40.54181 -74.19611
## 4 2 E 55th St, New York, NY 10022, USA 40.76140 -73.97461
## 5 5 E 64th St, New York, NY 10065, USA 40.76722 -73.96986
## 6 584 Park Pl, Brooklyn, NY 11238, USA 40.67436 -73.95872
#Deleting the irrelevant variables for regression analysis.
mydata <- mydata [,c(-1,-7,-8,-9, -10,-11,-12,-13,-14,-15,-16,-17) ]
head (mydata)
## TYPE PRICE BEDS BATH PROPERTYSQFT
## 1 Condo for sale 315000 2 2.000000 1400
## 2 Condo for sale 195000000 7 10.000000 17545
## 3 House for sale 260000 4 2.000000 2015
## 4 Condo for sale 69000 3 1.000000 445
## 5 Townhouse for sale 55000000 7 2.373861 14175
## 6 House for sale 690000 5 2.000000 4004
The source of data is Kaggle, New York Housing Market. (https://www.kaggle.com/datasets/nelgiriyewithana/new-york-housing-market).
set.seed(1)
mydata2 <- mydata[sample(nrow(mydata), 400), ] #Making the sample out of data.
head (mydata2)
## TYPE PRICE BEDS BATH PROPERTYSQFT
## 1017 Multi-family home for sale 989000 5 3 2184.208
## 4775 Multi-family home for sale 890000 4 3 2184.208
## 2177 Co-op for sale 425000 3 2 1300.000
## 1533 Co-op for sale 298999 2 1 2184.208
## 4567 House for sale 949000 4 2 2172.000
## 2347 Multi-family home for sale 16995000 5 4 4230.000
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Assigning ID's to my data
mydata2 <- mydata2 %>% mutate(ID = row_number())
library(dplyr)
# Dropping categorical variables that are not values for property types.
mydata2 <- mydata2 %>% filter(TYPE != 'Coming Soon')
mydata2 <- mydata2 %>% filter(TYPE != 'Condop for sale')
mydata2 <- mydata2 %>% filter(TYPE != 'Contingent')
mydata2 <- mydata2 %>% filter(TYPE != 'For Sale')
mydata2 <- mydata2 %>% filter(TYPE != 'Foreclosure')
mydata2 <- mydata2 %>% filter(TYPE != 'Pending')
#Making factors out of categorical variables.
mydata2$typeF <- factor(mydata2$TYPE,
levels = c("Co-op for sale","Condo for sale","House for sale","Land for sale","Mobile house for sale", "Multi-family home for sale", "Townhouse for sale"),
labels=c("Co-op for sale","Condo for sale","House for sale", "Land for sale", "Mobile house for sale", "Multi-family home for sale","Townhouse for sale" ))
#Cleaning the data, removing any possible N/A's.
#install.packages("tidyr")
library(tidyr)
mydata2 <- drop_na(mydata2)
head(mydata2)
## TYPE PRICE BEDS BATH PROPERTYSQFT ID typeF
## 1 Multi-family home for sale 989000 5 3 2184.208 1 Multi-family home for sale
## 2 Multi-family home for sale 890000 4 3 2184.208 2 Multi-family home for sale
## 3 Co-op for sale 425000 3 2 1300.000 3 Co-op for sale
## 4 Co-op for sale 298999 2 1 2184.208 4 Co-op for sale
## 5 House for sale 949000 4 2 2172.000 5 House for sale
## 6 Multi-family home for sale 16995000 5 4 4230.000 6 Multi-family home for sale
#Descriptive stats.
summary(mydata2)
## TYPE PRICE BEDS BATH PROPERTYSQFT
## Length:368 Min. : 5800 Min. : 1.000 Min. : 0.00 Min. : 250
## Class :character 1st Qu.: 495750 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.:1299
## Mode :character Median : 869000 Median : 3.000 Median : 2.00 Median :2184
## Mean : 1785418 Mean : 3.587 Mean : 2.42 Mean :2069
## 3rd Qu.: 1499000 3rd Qu.: 4.000 3rd Qu.: 3.00 3rd Qu.:2184
## Max. :25000000 Max. :20.000 Max. :12.00 Max. :9000
##
## ID typeF
## Min. : 1.00 Co-op for sale :104
## 1st Qu.: 99.75 Condo for sale : 75
## Median :200.50 House for sale : 96
## Mean :200.85 Land for sale : 8
## 3rd Qu.:302.25 Mobile house for sale : 0
## Max. :400.00 Multi-family home for sale: 66
## Townhouse for sale : 19
#Checking for linearity and perfect correlation.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata2[ , c(-1, -6,-7)],
smooth = FALSE)
Linearity is not met (price - beds). Have to logarithmically transform price.
mydata2$lnPrice<-log(mydata2$PRICE)
#Rechecking
scatterplotMatrix(mydata2[ , c(-1, -6,-7)],
smooth = FALSE)
In the first scatterplot there was no linearity, therefore we included a log to dependent variable, and now the model is linear.
#Correlation matrix
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata2[ , c(-1,-2, -6,-7)]))
## BEDS BATH PROPERTYSQFT lnPrice
## BEDS 1.00 0.80 0.53 0.41
## BATH 0.80 1.00 0.59 0.59
## PROPERTYSQFT 0.53 0.59 1.00 0.55
## lnPrice 0.41 0.59 0.55 1.00
##
## n= 368
##
##
## P
## BEDS BATH PROPERTYSQFT lnPrice
## BEDS 0 0 0
## BATH 0 0 0
## PROPERTYSQFT 0 0 0
## lnPrice 0 0 0
We can see that we can reject H0 for all numerical explanatory variables at p-value<0.001. There is correlation between price and number of beds, baths and property sq footage.
fit1 <- lm(lnPrice~ BEDS+BATH+PROPERTYSQFT+typeF,
data = mydata2)
summary(fit1)
##
## Call:
## lm(formula = lnPrice ~ BEDS + BATH + PROPERTYSQFT + typeF, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3987 -0.3407 0.0035 0.3513 2.9996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.228e+01 9.686e-02 126.797 < 2e-16 ***
## BEDS -6.446e-02 2.874e-02 -2.243 0.0255 *
## BATH 2.925e-01 4.322e-02 6.768 5.33e-11 ***
## PROPERTYSQFT 3.152e-04 4.455e-05 7.075 7.87e-12 ***
## typeFCondo for sale 8.213e-01 1.151e-01 7.134 5.43e-12 ***
## typeFHouse for sale 3.157e-01 1.123e-01 2.812 0.0052 **
## typeFLand for sale -4.062e-01 2.720e-01 -1.493 0.1362
## typeFMulti-family home for sale 2.954e-01 1.455e-01 2.030 0.0431 *
## typeFTownhouse for sale 8.184e-01 2.025e-01 4.041 6.50e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7362 on 359 degrees of freedom
## Multiple R-squared: 0.5171, Adjusted R-squared: 0.5064
## F-statistic: 48.06 on 8 and 359 DF, p-value: < 2.2e-16
#Checking for multicolinearity in fit1.
vif(fit1)
## GVIF Df GVIF^(1/(2*Df))
## BEDS 3.509717 1 1.873424
## BATH 3.317846 1 1.821495
## PROPERTYSQFT 1.653482 1 1.285878
## typeF 1.944051 5 1.068737
There is no too stong multicolinearity, due to all values being below 5, therefore no need to drop any of the variable.
mydata2$StdResid <- round(rstandard(fit1),3) #Standardized residuals
mydata2$CooksD <- round(cooks.distance(fit1),3) # Cooks distances
#Checking for outliers.
hist(mydata2$StdResid,
xlab= "Standardized residuals",
ylab= "Frequency",
main= "Histogram of standardized residuals")
There is a number of values outside +/-3 and have to be removed.
head(mydata2[order(-mydata2$StdResid),],5)
## TYPE PRICE BEDS BATH PROPERTYSQFT ID typeF lnPrice StdResid
## 364 Townhouse for sale 14995000 3 0.000000 2184.208 396 Townhouse for sale 16.52323 4.152
## 76 Co-op for sale 12500000 5 2.373861 2184.208 82 Co-op for sale 16.34124 4.109
## 322 Land for sale 3950000 3 2.373861 2184.208 350 Land for sale 15.18923 3.086
## 346 Townhouse for sale 22500000 4 2.373861 4000.000 376 Townhouse for sale 16.92903 2.989
## 60 Co-op for sale 15500000 4 6.000000 2184.208 65 Co-op for sale 16.55635 2.925
## CooksD
## 364 0.171
## 76 0.032
## 322 0.151
## 346 0.064
## 60 0.059
#Removing outliers.
library(dplyr)
mydata2 <- mydata2 %>%
filter(!ID==396,
!ID==82,
!ID==350)
head(mydata2[order(mydata2$StdResid),],5)
## TYPE PRICE BEDS BATH PROPERTYSQFT ID typeF
## 88 Land for sale 5800 3 2.373861 2184.208 96 Land for sale
## 279 Townhouse for sale 2495000 8 8.000000 5374.000 306 Townhouse for sale
## 10 Condo for sale 200000 1 2.373861 637.000 12 Condo for sale
## 154 Multi-family home for sale 1215000 13 7.000000 5250.000 167 Multi-family home for sale
## 157 Condo for sale 290000 1 1.000000 2184.208 171 Condo for sale
## lnPrice StdResid CooksD
## 88 8.665613 -6.388 0.648
## 279 14.729799 -2.676 0.071
## 10 12.206073 -2.372 0.014
## 154 14.010255 -1.997 0.025
## 157 12.577636 -1.976 0.008
#Removing outliers.
library(dplyr)
mydata2 <- mydata2 %>%
filter(!ID==96)
#Rechecking for outliers.
hist(mydata2$StdResid,
xlab= "Standardized residuals",
ylab= "Frequency",
main= "Histogram of standardized residuals")
We removed all outliers, no values out of +/-3 anymore.
#checking for High influential units.
hist(mydata2$CooksD,
xlab= "Cooks Distance",
ylab= "Frequency",
main= "Histogram of Cooks distances")
head(mydata2[order(-mydata2$CooksD),],10)
## TYPE PRICE BEDS BATH PROPERTYSQFT ID typeF
## 278 Townhouse for sale 2495000 8 8.000000 5374.000 306 Townhouse for sale
## 109 Land for sale 2000000 3 2.373861 2184.208 121 Land for sale
## 343 Townhouse for sale 22500000 4 2.373861 4000.000 376 Townhouse for sale
## 60 Co-op for sale 15500000 4 6.000000 2184.208 65 Co-op for sale
## 93 Multi-family home for sale 7999000 20 10.000000 2184.208 102 Multi-family home for sale
## 290 House for sale 5500000 8 6.000000 9000.000 318 House for sale
## 149 Multi-family home for sale 2499000 18 12.000000 2520.000 163 Multi-family home for sale
## 280 Land for sale 1499000 3 2.373861 2184.208 308 Land for sale
## 123 Land for sale 149000 3 2.373861 2184.208 135 Land for sale
## 347 Townhouse for sale 3995000 8 8.000000 5046.000 381 Townhouse for sale
## lnPrice StdResid CooksD
## 278 14.72980 -2.676 0.071
## 109 14.50866 2.097 0.070
## 343 16.92903 2.989 0.064
## 60 16.55635 2.925 0.059
## 93 15.89483 1.495 0.056
## 290 15.52026 -1.693 0.054
## 149 14.73140 -1.475 0.049
## 280 14.22031 1.679 0.045
## 123 11.91170 -1.674 0.044
## 347 15.20055 -1.862 0.034
All values are below 1, however we have one bigger gap, therefore the values above gap have to be removed.
#Removing high influential units.
library(dplyr)
mydata2 <- mydata2 %>%
filter(!ID==306,
!ID==121,
!ID==376,
!ID==65,
!ID==102,
!ID==318,
!ID==163,
!ID==308,
!ID==135)
#Rechecking for linearity and perfect correlation, after removing outliers and high influential units.
library(car)
scatterplotMatrix(mydata2[ , c(-1,-2, -6,-7,-9,-10)],
smooth = FALSE)
#Scatterplot of standardized residuals and standardized fitted values - checking for homoskedasticity.
#Making another fit, due to exclusion of multiple outliers and high influential units.
fit2 <- lm(lnPrice~ BEDS+BATH+PROPERTYSQFT+typeF,
data = mydata2)
mydata2$StdFitted <- scale(fit2$fitted.values)
#Checking for heteroskedasticity
library(car)
scatterplot(y=mydata2$StdResid, x=mydata2$StdFitted,
ylab= "Standardized residuals",
xlab= "Standardized fitted values",
boxplots= FALSE,
regLine = FALSE,
smooth= FALSE)
#Breusch-Pagan test for heteroskedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------
## Response : lnPrice
## Variables: fitted values of lnPrice
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 8.400273
## Prob > Chi2 = 0.003751646
We reject H0 at p-value= 0.004. We have a heteroskedastic model. Have to use robust errors.
library(estimatr)
#Trasformig fit to include robust errors.
fit_robust1<- lm_robust(lnPrice~ BEDS+BATH+PROPERTYSQFT+typeF,
se_type="HC1",
data = mydata2)
#Results of regression model.
summary(fit_robust1)
##
## Call:
## lm_robust(formula = lnPrice ~ BEDS + BATH + PROPERTYSQFT + typeF,
## data = mydata2, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 12.1793736 9.066e-02 134.3447 1.274e-300 12.0010642 12.3576830 346
## BEDS -0.0872997 2.794e-02 -3.1246 1.931e-03 -0.1422531 -0.0323463 346
## BATH 0.3340432 4.259e-02 7.8431 5.514e-14 0.2502737 0.4178128 346
## PROPERTYSQFT 0.0003436 4.367e-05 7.8688 4.632e-14 0.0002577 0.0004295 346
## typeFCondo for sale 0.8416410 1.142e-01 7.3671 1.290e-12 0.6169415 1.0663405 346
## typeFHouse for sale 0.3533896 1.086e-01 3.2542 1.249e-03 0.1398028 0.5669764 346
## typeFLand for sale -0.1212493 2.028e-01 -0.5978 5.503e-01 -0.5201525 0.2776540 346
## typeFMulti-family home for sale 0.3099613 1.293e-01 2.3977 1.703e-02 0.0556990 0.5642237 346
## typeFTownhouse for sale 0.5755176 1.786e-01 3.2230 1.389e-03 0.2243056 0.9267297 346
##
## Multiple R-squared: 0.6021 , Adjusted R-squared: 0.5929
## F-statistic: 49.22 on 8 and 346 DF, p-value: < 2.2e-16
#Coefficient of multiple correlation
sqrt(summary(fit_robust1)$r.squared)
## [1] 0.7759453
From the summary we can see:
All variables are statistically significant.
H0: b1=0
H1: b1=/0
If the apartment has an additional bed, the price will on average decrease by 8.7 % (at p-value<0.001), assuming everything else remains the same.
H0: b2=0
H1: b2=/0
If the apartment has an additional bath, the price will on average increase by 33.4 %, (at p-value<0.001), ssuming everything else remains the same. (39.65 if we do e^b3, but at small values of b’s that is not needed)
H0: b3=0
H1: b3=/0
If the size of an apartment increases by 1 square meter, the price will on average increase by 0.034 %,(at p-value<0.001)ssuming everything else remains the same.
H0: b4=0
H1: b4=/0
Given all other variables, a Condo has on average a price higher by 130.5 % compared to a Co-op, (p-value<0.001).
H0: b5=0
H1: b5=/0 Given all other variables, a House has on average a price higher by 41.1 % compared to a Co-op, (p-value<0.001).
H0: b6=0
H1: b6=/0
Given all other variables, a Land has on average a price lower by 13 % compared to a Co-op, (p-value<0.001).
H0: b7=0
H1: b7=/0
Given all other variables, a Multi family home has on average a price higher by 35.2 % compared to a Co-op, (p-value<0.001).
H0: b8=0
H1: b8=/0
Given all other variables, a Townhouse has on average a price higher by 75.8 % compared to a Co-op, (p-value<0.001).
Multiple Multiple R^2: 60.21 % of variability of price of an apartment can be explained with variability of number of beds and baths, property square footage and the type of apartment.
F-Satistics
HO: Rho^2=0
H1: Rho^2=/0 We reject H0 at p-value < 0.001.
Based on sample data we can conclude that there is a strong linear correlation between the dependent variable - Price and explanatory variables in our model. Compared to Co-op (reference group), all other types of apatments are on average more expensive, only Land is cheaper. All explanatory variables are significant, therefore have an effect on the price changes of apartments.