Real-Estate

A glimpse through R

Arpan Dutta
Soumyajit Roy
Sourav Biswas

Packages

Used packages
library(ggplot2)
library(Matrix)

Introduction to the Data

Data : Real-Estate

Variable Description
Price Sales price of residence (dollars)
Sqft Finished area of residence (square feet)
Bedroom Total number of bedrooms in residence
Bathroom Total number of bathrooms in residence
Airconditioning \(1\) = Presence of air conditioning, \(0\) = otherwise
Garage Number of cars that garage will hold
Pool \(1\) = Presence of Pool, \(0\) = otherwise
YearBuild Year of Construction
Quality \(1\) = High quality, \(2\) = Medium, \(3\) = Low
Lot Lot size (in square feet)
AdjHighway \(1\) = if the property is adjacent to a highway, \(0\) = otherwise.

Questions we are interested in…

  • All the variables have significant effect on Price or not.

  • Whether older houses tend to have lower prices.

  • How much adjacency to highway affects the price.

  • Is the average price different between air conditioned house and non air conditioned house. etc.

Structure of the Data

Code
real<-read.csv("C:\\users\\arpan\\Documents\\data\\real-estate.csv")
real<-real[,-1] #---removing ID
for(i in c(5,7,9,11))real[,i]<-as.factor(real[,i])
str(real)
'data.frame':   522 obs. of  11 variables:
 $ Price          : int  360000 340000 250000 205500 275500 248000 229900 150000 195000 160000 ...
 $ Sqft           : int  3032 2058 1780 1638 2196 1966 2216 1597 1622 1976 ...
 $ Bedroom        : int  4 4 4 4 4 4 3 2 3 3 ...
 $ Bathroom       : int  4 2 3 2 3 3 2 1 2 3 ...
 $ Airconditioning: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
 $ Garage         : int  2 2 2 2 2 5 2 1 2 1 ...
 $ Pool           : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
 $ YearBuild      : int  1972 1976 1980 1963 1968 1972 1972 1955 1975 1918 ...
 $ Quality        : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 3 3 ...
 $ Lot            : int  22221 22912 21345 17342 21786 18902 18639 22112 14321 32358 ...
 $ AdjHighway     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Modelling

Fitting a Linear Model

Firstly we fit the following linear model with all the variables, \[\boldsymbol{y}=X_{n\times p}\boldsymbol{\beta}+\boldsymbol{\epsilon}\] and, \(n=522,\:p=12\)

\[\boldsymbol{\beta}=\left(\alpha,\beta_{1},\beta_{2},\ldots,\beta_{11}\right)'\] \[X=\left(\boldsymbol{1_{n},x_{1},x_{2},\ldots,x_{11}}\right)\] assuming, \[\boldsymbol{\epsilon}\overset{iid}{\sim}\mathcal{N}\left(0,\sigma^{2}\right)\]

This is an ANOCOVA Model.

Finding OLS

\[\boldsymbol{\hat{\beta}_{OLS}}=\left(X'X\right)^{-1}X'\boldsymbol{y}\]

OLS of Beta
fm = lm(Price~.,data = real)
x<-model.matrix(fm)
xtx<-crossprod(x)
beta.cap<-solve(xtx)%*%t(x)%*%(real$Price)
round(beta.cap,4)
                          [,1]
(Intercept)      -2358196.4156
Sqft                   87.0047
Bedroom             -5125.0967
Bathroom             8126.9009
Airconditioning1     4850.7151
Garage              10888.3678
Pool1               10138.7609
YearBuild            1269.4213
Quality2          -142985.0702
Quality3          -148375.5019
Lot                     1.5565
AdjHighway1        -27373.9498

Model Summary

Code
summary(fm)

Call:
lm(formula = Price ~ ., data = real)

Residuals:
    Min      1Q  Median      3Q     Max 
-204865  -28010   -4973   21315  298892 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.358e+06  3.991e+05  -5.909 6.29e-09 ***
Sqft              8.700e+01  6.570e+00  13.242  < 2e-16 ***
Bedroom          -5.125e+03  3.275e+03  -1.565   0.1182    
Bathroom          8.127e+03  4.288e+03   1.895   0.0586 .  
Airconditioning1  4.851e+03  8.086e+03   0.600   0.5488    
Garage            1.089e+04  5.060e+03   2.152   0.0319 *  
Pool1             1.014e+04  1.040e+04   0.975   0.3303    
YearBuild         1.269e+03  2.024e+02   6.272 7.60e-10 ***
Quality2         -1.430e+05  1.021e+04 -14.007  < 2e-16 ***
Quality3         -1.484e+05  1.404e+04 -10.564  < 2e-16 ***
Lot               1.556e+00  2.363e-01   6.587 1.12e-10 ***
AdjHighway1      -2.737e+04  1.810e+04  -1.512   0.1311    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 58770 on 510 degrees of freedom
Multiple R-squared:  0.8223,    Adjusted R-squared:  0.8184 
F-statistic: 214.5 on 11 and 510 DF,  p-value: < 2.2e-16

Price vs Sqft

Code
ggplot(data=real,aes(x=Sqft,y=Price))+geom_point(col='tomato')

Price vs AirConditioning

Code
ggplot(data=real,aes(x=Airconditioning,y=Price,colour=Airconditioning))+
geom_boxplot()+ labs(caption='Data=real-estate')

Older house tends to lower price?

Code
ggplot(data=real,aes(x=YearBuild,y=Price))+
geom_point(col='tomato')+ labs(caption='Data=real-estate')

Price vs Sqft w.r.t. Quality

Code
#---plot | Quality
ggplot(data=real,aes(x=Sqft,y=Price,colour=Quality))+
geom_point()+ labs(caption='Data=real-estate',title='Scatterplot of Price against Quality')

How much adjacency to highway affects the price?

Code
ggplot(data=real,aes(x=AdjHighway,y=Price,colour=AdjHighway))+
geom_boxplot()

Model Summary

Code
fm<-lm(Price~.,data=real)
summary(fm)

Call:
lm(formula = Price ~ ., data = real)

Residuals:
    Min      1Q  Median      3Q     Max 
-204865  -28010   -4973   21315  298892 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.358e+06  3.991e+05  -5.909 6.29e-09 ***
Sqft              8.700e+01  6.570e+00  13.242  < 2e-16 ***
Bedroom          -5.125e+03  3.275e+03  -1.565   0.1182    
Bathroom          8.127e+03  4.288e+03   1.895   0.0586 .  
Airconditioning1  4.851e+03  8.086e+03   0.600   0.5488    
Garage            1.089e+04  5.060e+03   2.152   0.0319 *  
Pool1             1.014e+04  1.040e+04   0.975   0.3303    
YearBuild         1.269e+03  2.024e+02   6.272 7.60e-10 ***
Quality2         -1.430e+05  1.021e+04 -14.007  < 2e-16 ***
Quality3         -1.484e+05  1.404e+04 -10.564  < 2e-16 ***
Lot               1.556e+00  2.363e-01   6.587 1.12e-10 ***
AdjHighway1      -2.737e+04  1.810e+04  -1.512   0.1311    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 58770 on 510 degrees of freedom
Multiple R-squared:  0.8223,    Adjusted R-squared:  0.8184 
F-statistic: 214.5 on 11 and 510 DF,  p-value: < 2.2e-16

Inspecting the underlying assumptions

Model Assumptions of the Full Model

  • Errors have mean zero.

  • Errors are Homoscedastic.

  • Errors are uncorrelated.

  • Errors are Normally distributed.

Plotting Residuals

Code
residuals<-residuals(fm)
fitted<-fitted.values(fm)
ggobj=ggplot()+geom_point(aes(x=1:length(residuals),y=residuals),col='tomato')+
  geom_hline(yintercept = 0,col='navy')+labs(x="Index",y="Residuals")
ggobj

Homoscedastic?

Code
ggobj=ggplot()+geom_point(aes(x=1:length(residuals),y=residuals**2),col='tomato')+
  geom_hline(yintercept = 0,col='navy')+labs(x="Index",y=expression(e^2))
ggobj

Normality Checking

Code
ggplot()+stat_qq(aes(sample=residuals),col='tomato')+
  stat_qq_line(aes(sample=residuals),col='navy')+labs(y="Sample Quantiles",x="Theoretical Quantiles",title="Normal QQPlot of Residuals")

Fitted vs. Response

Code
ggplot(data=real,aes(x=Price,y=fitted))+geom_point(col='tomato')+geom_abline(intercept=0,slope=1,col='navy')+labs(x="Observed Values",y="Fitted Values",title="Scatterplot of Response against Fitted values")

Thank You.