Zoja Ciglenecki

#Importing the data from Kaggle
#library(readr) 
mydata <-read.csv ("~/Desktop/IMB/NY-House-Dataset.csv",header= TRUE,sep=",", dec=".")
head(mydata)
##                                                                  BROKERTITLE               TYPE
## 1                                Brokered by Douglas Elliman  -111 Fifth Ave     Condo for sale
## 2                                                        Brokered by Serhant     Condo for sale
## 3                                                     Brokered by Sowae Corp     House for sale
## 4                                                        Brokered by COMPASS     Condo for sale
## 5 Brokered by Sotheby's International Realty - East Side Manhattan Brokerage Townhouse for sale
## 6                                                     Brokered by Sowae Corp     House for sale
##       PRICE BEDS      BATH PROPERTYSQFT
## 1    315000    2  2.000000         1400
## 2 195000000    7 10.000000        17545
## 3    260000    4  2.000000         2015
## 4     69000    3  1.000000          445
## 5  55000000    7  2.373861        14175
## 6    690000    5  2.000000         4004
##                                                              ADDRESS                   STATE
## 1                                               2 E 55th St Unit 803      New York, NY 10022
## 2 Central Park Tower Penthouse-217 W 57th New York St Unit Penthouse      New York, NY 10019
## 3                                                   620 Sinclair Ave Staten Island, NY 10312
## 4                                            2 E 55th St Unit 908W33     Manhattan, NY 10022
## 5                                                        5 E 64th St      New York, NY 10065
## 6                                                        584 Park Pl      Brooklyn, NY 11238
##                                                                           MAIN_ADDRESS
## 1                                               2 E 55th St Unit 803New York, NY 10022
## 2 Central Park Tower Penthouse-217 W 57th New York St Unit PenthouseNew York, NY 10019
## 3                                              620 Sinclair AveStaten Island, NY 10312
## 4                                           2 E 55th St Unit 908W33Manhattan, NY 10022
## 5                                                        5 E 64th StNew York, NY 10065
## 6                                                        584 Park PlBrooklyn, NY 11238
##   ADMINISTRATIVE_AREA_LEVEL_2 LOCALITY     SUBLOCALITY      STREET_NAME        LONG_NAME
## 1             New York County New York       Manhattan East 55th Street  Regis Residence
## 2               United States New York New York County         New York West 57th Street
## 3               United States New York Richmond County    Staten Island  Sinclair Avenue
## 4               United States New York New York County         New York East 55th Street
## 5               United States New York New York County         New York East 64th Street
## 6               United States New York    Kings County         Brooklyn       Park Place
##                                            FORMATTED_ADDRESS LATITUDE LONGITUDE
## 1 Regis Residence, 2 E 55th St #803, New York, NY 10022, USA 40.76125 -73.97448
## 2                     217 W 57th St, New York, NY 10019, USA 40.76639 -73.98099
## 3             620 Sinclair Ave, Staten Island, NY 10312, USA 40.54181 -74.19611
## 4                       2 E 55th St, New York, NY 10022, USA 40.76140 -73.97461
## 5                       5 E 64th St, New York, NY 10065, USA 40.76722 -73.96986
## 6                       584 Park Pl, Brooklyn, NY 11238, USA 40.67436 -73.95872
#Deleting the irrelevant variables for regression analysis.
mydata <- mydata [,c(-1,-7,-8,-9, -10,-11,-12,-13,-14,-15,-16,-17) ] 
head (mydata)
##                 TYPE     PRICE BEDS      BATH PROPERTYSQFT
## 1     Condo for sale    315000    2  2.000000         1400
## 2     Condo for sale 195000000    7 10.000000        17545
## 3     House for sale    260000    4  2.000000         2015
## 4     Condo for sale     69000    3  1.000000          445
## 5 Townhouse for sale  55000000    7  2.373861        14175
## 6     House for sale    690000    5  2.000000         4004

Explanation of dataset:

  • Unit of observation: An apartment in New York.
  • TYPE: Type of the house - categorical variable with 13 categories. (Co-op for sale,Coming Soon, Condo for sale, Condop for sale, Contingent,For Sale, Foreclosure, House for sale, Land for sale, Mobile house for sale, Multi-family home for sale, Pending, Townhouse for sale)
  • PRICE: Price of the house (in $)
  • BEDS: Number of bedrooms
  • BATH: Number of bathrooms
  • PROPERTYSQFT: Square footage of the property (in sqm)

The source of data is Kaggle, New York Housing Market. (https://www.kaggle.com/datasets/nelgiriyewithana/new-york-housing-market).

Cleaning the data

set.seed(1) 
mydata2 <- mydata[sample(nrow(mydata), 400), ] #Making the sample out of data.

head (mydata2)
##                            TYPE    PRICE BEDS BATH PROPERTYSQFT
## 1017 Multi-family home for sale   989000    5    3     2184.208
## 4775 Multi-family home for sale   890000    4    3     2184.208
## 2177             Co-op for sale   425000    3    2     1300.000
## 1533             Co-op for sale   298999    2    1     2184.208
## 4567             House for sale   949000    4    2     2172.000
## 2347 Multi-family home for sale 16995000    5    4     4230.000
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Assigning ID's to my data
mydata2 <- mydata2 %>% mutate(ID = row_number())
library(dplyr)

# Dropping categorical variables that are not values for property types.
mydata2 <- mydata2 %>% filter(TYPE != 'Coming Soon')
mydata2 <- mydata2 %>% filter(TYPE != 'Condop for sale')
mydata2 <- mydata2 %>% filter(TYPE != 'Contingent')
mydata2 <- mydata2 %>% filter(TYPE != 'For Sale')
mydata2 <- mydata2 %>% filter(TYPE != 'Foreclosure')
mydata2 <- mydata2 %>% filter(TYPE != 'Pending')
#Making factors out of categorical variables.
mydata2$typeF <- factor(mydata2$TYPE, 
                                  levels = c("Co-op for sale","Condo for sale","House for sale","Land for sale","Mobile house for sale", "Multi-family home for sale", "Townhouse for sale"),
                                  labels=c("Co-op for sale","Condo for sale","House for sale", "Land for sale", "Mobile house for sale", "Multi-family home for sale","Townhouse for sale" )) 
#Cleaning the data, removing any possible N/A's.
#install.packages("tidyr")
library(tidyr)
mydata2 <- drop_na(mydata2)

head(mydata2)
##                         TYPE    PRICE BEDS BATH PROPERTYSQFT ID                      typeF
## 1 Multi-family home for sale   989000    5    3     2184.208  1 Multi-family home for sale
## 2 Multi-family home for sale   890000    4    3     2184.208  2 Multi-family home for sale
## 3             Co-op for sale   425000    3    2     1300.000  3             Co-op for sale
## 4             Co-op for sale   298999    2    1     2184.208  4             Co-op for sale
## 5             House for sale   949000    4    2     2172.000  5             House for sale
## 6 Multi-family home for sale 16995000    5    4     4230.000  6 Multi-family home for sale

Research question: Do the type of the apartment, number of baths and beds and property square footage have an effect on the price of an apartment?

#Descriptive stats.
summary(mydata2)
##      TYPE               PRICE               BEDS             BATH        PROPERTYSQFT 
##  Length:368         Min.   :    5800   Min.   : 1.000   Min.   : 0.00   Min.   : 250  
##  Class :character   1st Qu.:  495750   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.:1299  
##  Mode  :character   Median :  869000   Median : 3.000   Median : 2.00   Median :2184  
##                     Mean   : 1785418   Mean   : 3.587   Mean   : 2.42   Mean   :2069  
##                     3rd Qu.: 1499000   3rd Qu.: 4.000   3rd Qu.: 3.00   3rd Qu.:2184  
##                     Max.   :25000000   Max.   :20.000   Max.   :12.00   Max.   :9000  
##                                                                                       
##        ID                                typeF    
##  Min.   :  1.00   Co-op for sale            :104  
##  1st Qu.: 99.75   Condo for sale            : 75  
##  Median :200.50   House for sale            : 96  
##  Mean   :200.85   Land for sale             :  8  
##  3rd Qu.:302.25   Mobile house for sale     :  0  
##  Max.   :400.00   Multi-family home for sale: 66  
##                   Townhouse for sale        : 19

Interpretation

  • Sample size: 368 apartments
  • The price expands from 5800$ to 25 million$.
  • The propert square footage expands from 250 sqm to 9000 sqm.
  • The average number of beds is 3.59 per apartment.
  • The average number of baths is 2.42 per apartment.
  • In our database we will be comparing: 104 Co-ops, 75 Condons, 96 Houses, 8 Lands, 66 Multi-family homes and 19 Townhouses and no Mobile houses.

Checking assumptions

  • Linearity in parameters. Price of an apartment has to be a linear function of property square footage, number of beds and baths and type of the apartment. I transformed Price (dependent variable) with a logarithm.
  • Expected value of errors equals 0. We included all theoretically relevant variables into the model. (variables such as location, might be missing)
  • Homoskedasticity.- Not met. Need to use robust errors.
  • Normal distribution of errors. Needed for small samples only, therefore can be assumed for our dataset.
  • Independent errors - we observe each apartment only once.
  • No perfect multicolinarity. - Checked with scatterplot, met.
  • Number of units observed are greater than the number of variables in the dataset.

Additional requirements

  • Dependent variable must be numerical- met, Price is numerical.
  • Explanatory variables must vary - met, different values of the variables.
  • Checked and removed outliers and high influential units.
  • No too strong multicolinearity - check VIF. - Met, no VIF above 5 or sqroot of 5 for categorical variable.
#Checking for linearity and perfect correlation.
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata2[ , c(-1, -6,-7)], 
                  smooth = FALSE) 

Linearity is not met (price - beds). Have to logarithmically transform price.

mydata2$lnPrice<-log(mydata2$PRICE)
#Rechecking
scatterplotMatrix(mydata2[ , c(-1, -6,-7)], 
                  smooth = FALSE) 

Interpretation:

In the first scatterplot there was no linearity, therefore we included a log to dependent variable, and now the model is linear.

#Correlation matrix
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata2[ , c(-1,-2, -6,-7)]))
##              BEDS BATH PROPERTYSQFT lnPrice
## BEDS         1.00 0.80         0.53    0.41
## BATH         0.80 1.00         0.59    0.59
## PROPERTYSQFT 0.53 0.59         1.00    0.55
## lnPrice      0.41 0.59         0.55    1.00
## 
## n= 368 
## 
## 
## P
##              BEDS BATH PROPERTYSQFT lnPrice
## BEDS               0    0            0     
## BATH          0         0            0     
## PROPERTYSQFT  0    0                 0     
## lnPrice       0    0    0
Interpretation
  • H0: Rho (price, numerical explanatory variables)=0
  • H1: Rho (price, numerical explanatory variables)=/0

We can see that we can reject H0 for all numerical explanatory variables at p-value<0.001. There is correlation between price and number of beds, baths and property sq footage.

fit1 <- lm(lnPrice~ BEDS+BATH+PROPERTYSQFT+typeF,
           data = mydata2)

summary(fit1)
## 
## Call:
## lm(formula = lnPrice ~ BEDS + BATH + PROPERTYSQFT + typeF, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3987 -0.3407  0.0035  0.3513  2.9996 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.228e+01  9.686e-02 126.797  < 2e-16 ***
## BEDS                            -6.446e-02  2.874e-02  -2.243   0.0255 *  
## BATH                             2.925e-01  4.322e-02   6.768 5.33e-11 ***
## PROPERTYSQFT                     3.152e-04  4.455e-05   7.075 7.87e-12 ***
## typeFCondo for sale              8.213e-01  1.151e-01   7.134 5.43e-12 ***
## typeFHouse for sale              3.157e-01  1.123e-01   2.812   0.0052 ** 
## typeFLand for sale              -4.062e-01  2.720e-01  -1.493   0.1362    
## typeFMulti-family home for sale  2.954e-01  1.455e-01   2.030   0.0431 *  
## typeFTownhouse for sale          8.184e-01  2.025e-01   4.041 6.50e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7362 on 359 degrees of freedom
## Multiple R-squared:  0.5171, Adjusted R-squared:  0.5064 
## F-statistic: 48.06 on 8 and 359 DF,  p-value: < 2.2e-16
#Checking for multicolinearity in fit1.

vif(fit1)
##                  GVIF Df GVIF^(1/(2*Df))
## BEDS         3.509717  1        1.873424
## BATH         3.317846  1        1.821495
## PROPERTYSQFT 1.653482  1        1.285878
## typeF        1.944051  5        1.068737
Interpretation

There is no too stong multicolinearity, due to all values being below 5, therefore no need to drop any of the variable.

mydata2$StdResid <- round(rstandard(fit1),3) #Standardized residuals
mydata2$CooksD <- round(cooks.distance(fit1),3) # Cooks distances
#Checking for outliers.
hist(mydata2$StdResid,
     xlab= "Standardized residuals",
     ylab= "Frequency",
     main= "Histogram of standardized residuals")

Interpretation

There is a number of values outside +/-3 and have to be removed.

head(mydata2[order(-mydata2$StdResid),],5)
##                   TYPE    PRICE BEDS     BATH PROPERTYSQFT  ID              typeF  lnPrice StdResid
## 364 Townhouse for sale 14995000    3 0.000000     2184.208 396 Townhouse for sale 16.52323    4.152
## 76      Co-op for sale 12500000    5 2.373861     2184.208  82     Co-op for sale 16.34124    4.109
## 322      Land for sale  3950000    3 2.373861     2184.208 350      Land for sale 15.18923    3.086
## 346 Townhouse for sale 22500000    4 2.373861     4000.000 376 Townhouse for sale 16.92903    2.989
## 60      Co-op for sale 15500000    4 6.000000     2184.208  65     Co-op for sale 16.55635    2.925
##     CooksD
## 364  0.171
## 76   0.032
## 322  0.151
## 346  0.064
## 60   0.059
#Removing outliers.
library(dplyr)
mydata2 <- mydata2 %>%
  filter(!ID==396,
         !ID==82,
         !ID==350)
head(mydata2[order(mydata2$StdResid),],5)
##                           TYPE   PRICE BEDS     BATH PROPERTYSQFT  ID                      typeF
## 88               Land for sale    5800    3 2.373861     2184.208  96              Land for sale
## 279         Townhouse for sale 2495000    8 8.000000     5374.000 306         Townhouse for sale
## 10              Condo for sale  200000    1 2.373861      637.000  12             Condo for sale
## 154 Multi-family home for sale 1215000   13 7.000000     5250.000 167 Multi-family home for sale
## 157             Condo for sale  290000    1 1.000000     2184.208 171             Condo for sale
##       lnPrice StdResid CooksD
## 88   8.665613   -6.388  0.648
## 279 14.729799   -2.676  0.071
## 10  12.206073   -2.372  0.014
## 154 14.010255   -1.997  0.025
## 157 12.577636   -1.976  0.008
#Removing outliers.
library(dplyr)
mydata2 <- mydata2 %>%
  filter(!ID==96)
#Rechecking for outliers.
hist(mydata2$StdResid,
     xlab= "Standardized residuals",
     ylab= "Frequency",
     main= "Histogram of standardized residuals")

Interpretation

We removed all outliers, no values out of +/-3 anymore.

#checking for High influential units.
hist(mydata2$CooksD,
     xlab= "Cooks Distance",
     ylab= "Frequency",
     main= "Histogram of Cooks distances")

head(mydata2[order(-mydata2$CooksD),],10) 
##                           TYPE    PRICE BEDS      BATH PROPERTYSQFT  ID                      typeF
## 278         Townhouse for sale  2495000    8  8.000000     5374.000 306         Townhouse for sale
## 109              Land for sale  2000000    3  2.373861     2184.208 121              Land for sale
## 343         Townhouse for sale 22500000    4  2.373861     4000.000 376         Townhouse for sale
## 60              Co-op for sale 15500000    4  6.000000     2184.208  65             Co-op for sale
## 93  Multi-family home for sale  7999000   20 10.000000     2184.208 102 Multi-family home for sale
## 290             House for sale  5500000    8  6.000000     9000.000 318             House for sale
## 149 Multi-family home for sale  2499000   18 12.000000     2520.000 163 Multi-family home for sale
## 280              Land for sale  1499000    3  2.373861     2184.208 308              Land for sale
## 123              Land for sale   149000    3  2.373861     2184.208 135              Land for sale
## 347         Townhouse for sale  3995000    8  8.000000     5046.000 381         Townhouse for sale
##      lnPrice StdResid CooksD
## 278 14.72980   -2.676  0.071
## 109 14.50866    2.097  0.070
## 343 16.92903    2.989  0.064
## 60  16.55635    2.925  0.059
## 93  15.89483    1.495  0.056
## 290 15.52026   -1.693  0.054
## 149 14.73140   -1.475  0.049
## 280 14.22031    1.679  0.045
## 123 11.91170   -1.674  0.044
## 347 15.20055   -1.862  0.034

All values are below 1, however we have one bigger gap, therefore the values above gap have to be removed.

#Removing high influential units.
library(dplyr)
mydata2 <- mydata2 %>%
  filter(!ID==306,
         !ID==121,
         !ID==376,
         !ID==65,
         !ID==102,
         !ID==318,
         !ID==163,
         !ID==308,
         !ID==135)
#Rechecking for linearity and perfect correlation, after removing outliers and high influential units.
library(car)
scatterplotMatrix(mydata2[ , c(-1,-2, -6,-7,-9,-10)], 
                  smooth = FALSE) 

#Scatterplot of standardized residuals and standardized fitted values - checking for homoskedasticity.
#Making another fit, due to exclusion of multiple outliers and high influential units.

fit2 <- lm(lnPrice~ BEDS+BATH+PROPERTYSQFT+typeF,
           data = mydata2)

mydata2$StdFitted <- scale(fit2$fitted.values)
#Checking for heteroskedasticity
library(car)
scatterplot(y=mydata2$StdResid, x=mydata2$StdFitted,
            ylab= "Standardized residuals",
            xlab= "Standardized fitted values",
            boxplots= FALSE,
            regLine = FALSE,
            smooth= FALSE)

#Breusch-Pagan test for heteroskedasticity.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : lnPrice 
##  Variables: fitted values of lnPrice 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    8.400273 
##  Prob > Chi2   =    0.003751646
Interpretation
  • H0: Homoskedasticity is met.
  • H1: Homoskedasticity is not met.

We reject H0 at p-value= 0.004. We have a heteroskedastic model. Have to use robust errors.

library(estimatr)
#Trasformig fit to include robust errors.
fit_robust1<- lm_robust(lnPrice~ BEDS+BATH+PROPERTYSQFT+typeF,
                        se_type="HC1",
                        data = mydata2)

#Results of regression model.
summary(fit_robust1)
## 
## Call:
## lm_robust(formula = lnPrice ~ BEDS + BATH + PROPERTYSQFT + typeF, 
##     data = mydata2, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                                   Estimate Std. Error  t value   Pr(>|t|)   CI Lower   CI Upper  DF
## (Intercept)                     12.1793736  9.066e-02 134.3447 1.274e-300 12.0010642 12.3576830 346
## BEDS                            -0.0872997  2.794e-02  -3.1246  1.931e-03 -0.1422531 -0.0323463 346
## BATH                             0.3340432  4.259e-02   7.8431  5.514e-14  0.2502737  0.4178128 346
## PROPERTYSQFT                     0.0003436  4.367e-05   7.8688  4.632e-14  0.0002577  0.0004295 346
## typeFCondo for sale              0.8416410  1.142e-01   7.3671  1.290e-12  0.6169415  1.0663405 346
## typeFHouse for sale              0.3533896  1.086e-01   3.2542  1.249e-03  0.1398028  0.5669764 346
## typeFLand for sale              -0.1212493  2.028e-01  -0.5978  5.503e-01 -0.5201525  0.2776540 346
## typeFMulti-family home for sale  0.3099613  1.293e-01   2.3977  1.703e-02  0.0556990  0.5642237 346
## typeFTownhouse for sale          0.5755176  1.786e-01   3.2230  1.389e-03  0.2243056  0.9267297 346
## 
## Multiple R-squared:  0.6021 ,    Adjusted R-squared:  0.5929 
## F-statistic: 49.22 on 8 and 346 DF,  p-value: < 2.2e-16
#Coefficient of multiple correlation
sqrt(summary(fit_robust1)$r.squared)
## [1] 0.7759453
  • Coefficient of multiple correlation (r) is 0.78, there is a strong and linear correlation.
Interpretation

From the summary we can see:

  • All variables are statistically significant.

  • H0: b1=0

  • H1: b1=/0

If the apartment has an additional bed, the price will on average decrease by 8.7 % (at p-value<0.001), assuming everything else remains the same.

  • H0: b2=0

  • H1: b2=/0

If the apartment has an additional bath, the price will on average increase by 33.4 %, (at p-value<0.001), ssuming everything else remains the same. (39.65 if we do e^b3, but at small values of b’s that is not needed)

  • H0: b3=0

  • H1: b3=/0

If the size of an apartment increases by 1 square meter, the price will on average increase by 0.034 %,(at p-value<0.001)ssuming everything else remains the same.

  • H0: b4=0

  • H1: b4=/0

Given all other variables, a Condo has on average a price higher by 130.5 % compared to a Co-op, (p-value<0.001).

  • H0: b5=0

  • H1: b5=/0 Given all other variables, a House has on average a price higher by 41.1 % compared to a Co-op, (p-value<0.001).

  • H0: b6=0

  • H1: b6=/0

Given all other variables, a Land has on average a price lower by 13 % compared to a Co-op, (p-value<0.001).

  • H0: b7=0

  • H1: b7=/0

Given all other variables, a Multi family home has on average a price higher by 35.2 % compared to a Co-op, (p-value<0.001).

  • H0: b8=0

  • H1: b8=/0

Given all other variables, a Townhouse has on average a price higher by 75.8 % compared to a Co-op, (p-value<0.001).

  • Multiple Multiple R^2: 60.21 % of variability of price of an apartment can be explained with variability of number of beds and baths, property square footage and the type of apartment.

  • F-Satistics

  • HO: Rho^2=0

  • H1: Rho^2=/0 We reject H0 at p-value < 0.001.

Conclusion

Based on sample data we can conclude that there is a strong linear correlation between the dependent variable - Price and explanatory variables in our model. Compared to Co-op (reference group), all other types of apatments are on average more expensive, only Land is cheaper. All explanatory variables are significant, therefore have an effect on the price changes of apartments.