Overview

In this report, we have two task. 1. Do some exploratory analysis for dataset. 2. Build models to predict the close price with and without list price.

Exploratory Analysis

Data is downloaded online. The information including property features and price. Here is a briefe summary of the data.

##    ListingId         LivingArea       NumBedrooms        NumBaths       
##  Min.   :5077399   Min.   :      0   Min.   : 0.000   Min.   :   0.000  
##  1st Qu.:5098855   1st Qu.:   1427   1st Qu.: 3.000   1st Qu.:   2.000  
##  Median :5120464   Median :   1837   Median : 3.000   Median :   2.000  
##  Mean   :5120968   Mean   :   2240   Mean   : 3.292   Mean   :   2.387  
##  3rd Qu.:5143012   3rd Qu.:   2446   3rd Qu.: 4.000   3rd Qu.:   2.500  
##  Max.   :5178286   Max.   :9999999   Max.   :16.000   Max.   :1047.000  
##                    NA's   :209                                          
##                        Pool       ExteriorStories     ListDate         
##  Both Private & Community: 2260   Min.   : 1.000   Min.   :2014-03-01  
##  Community               :14249   1st Qu.: 1.000   1st Qu.:2014-04-10  
##  None                    :26700   Median : 1.000   Median :2014-05-23  
##  Private                 :14049   Mean   : 1.374   Mean   :2014-05-25  
##                                   3rd Qu.: 2.000   3rd Qu.:2014-07-10  
##                                   Max.   :23.000   Max.   :2014-08-31  
##                                                                        
##    ListPrice            GeoLat          GeoLon          CloseDate         
##  Min.   :     750   Min.   :30.21   Min.   :-151.06   Min.   :2014-03-02  
##  1st Qu.:  149900   1st Qu.:33.36   1st Qu.:-112.20   1st Qu.:2014-06-25  
##  Median :  215000   Median :33.49   Median :-111.99   Median :2014-08-18  
##  Mean   :  289744   Mean   :33.51   Mean   :-111.99   Mean   :2014-08-22  
##  3rd Qu.:  325000   3rd Qu.:33.62   3rd Qu.:-111.79   3rd Qu.:2014-10-09  
##  Max.   :16950000   Max.   :60.48   Max.   : -95.46   Max.   :2015-03-31  
##                     NA's   :99      NA's   :99        NA's   :19324       
##    ClosePrice              ListingStatus  
##  Min.   :      1   Active         : 2525  
##  1st Qu.: 140000   Cancelled      :11284  
##  Median : 195000   Closed         :37606  
##  Mean   : 240641   Expired        : 4750  
##  3rd Qu.: 280475   Pending        :  770  
##  Max.   :5500000   Temp Off Market:  323  
##  NA's   :19324                            
##                    DwellingType  
##  Single Family - Detached:48054  
##  Townhouse               : 3813  
##  Apartment Style/Flat    : 2866  
##  Mfg/Mobile Housing      : 1363  
##  Patio Home              :  721  
##  Gemini/Twin Home        :  336  
##  (Other)                 :  105
## Warning: Removed 19324 rows containing missing values (geom_point).

There are several null data. Here is how we handle it: For LivingArea, we take the median number to fill null. For GeoLat and GeoLon, we take the average number to fill null. For CloseDate and ClosePrice, that will be our outcome, we can ingore the null here. To build the model, we only take the data with CloseDate and ClosePrice. Here we go, get the new dataset without null data named ncData.

ncData <- cData
ncData[is.na(ncData$LivingArea),"LivingArea"]=1837
ncData[is.na(ncData$GeoLat),"GeoLat"]=33.51
ncData[is.na(ncData$GeoLon),"GeoLon"]=-111.79

Take a look at summary of data, no null data anymore.

##    ListingId         LivingArea       NumBedrooms        NumBaths       
##  Min.   :5077399   Min.   :      0   Min.   : 0.000   Min.   :   0.000  
##  1st Qu.:5098855   1st Qu.:   1428   1st Qu.: 3.000   1st Qu.:   2.000  
##  Median :5120464   Median :   1837   Median : 3.000   Median :   2.000  
##  Mean   :5120968   Mean   :   2238   Mean   : 3.292   Mean   :   2.387  
##  3rd Qu.:5143012   3rd Qu.:   2443   3rd Qu.: 4.000   3rd Qu.:   2.500  
##  Max.   :5178286   Max.   :9999999   Max.   :16.000   Max.   :1047.000  
##                                                                         
##                        Pool       ExteriorStories     ListDate         
##  Both Private & Community: 2260   Min.   : 1.000   Min.   :2014-03-01  
##  Community               :14249   1st Qu.: 1.000   1st Qu.:2014-04-10  
##  None                    :26700   Median : 1.000   Median :2014-05-23  
##  Private                 :14049   Mean   : 1.374   Mean   :2014-05-25  
##                                   3rd Qu.: 2.000   3rd Qu.:2014-07-10  
##                                   Max.   :23.000   Max.   :2014-08-31  
##                                                                        
##    ListPrice            GeoLat          GeoLon          CloseDate         
##  Min.   :     750   Min.   :30.21   Min.   :-151.06   Min.   :2014-03-02  
##  1st Qu.:  149900   1st Qu.:33.36   1st Qu.:-112.20   1st Qu.:2014-06-25  
##  Median :  215000   Median :33.49   Median :-111.99   Median :2014-08-18  
##  Mean   :  289744   Mean   :33.51   Mean   :-111.98   Mean   :2014-08-22  
##  3rd Qu.:  325000   3rd Qu.:33.62   3rd Qu.:-111.79   3rd Qu.:2014-10-09  
##  Max.   :16950000   Max.   :60.48   Max.   : -95.46   Max.   :2015-03-31  
##                                                       NA's   :19324       
##    ClosePrice              ListingStatus  
##  Min.   :      1   Active         : 2525  
##  1st Qu.: 140000   Cancelled      :11284  
##  Median : 195000   Closed         :37606  
##  Mean   : 240641   Expired        : 4750  
##  3rd Qu.: 280475   Pending        :  770  
##  Max.   :5500000   Temp Off Market:  323  
##  NA's   :19324                            
##                    DwellingType  
##  Single Family - Detached:48054  
##  Townhouse               : 3813  
##  Apartment Style/Flat    : 2866  
##  Mfg/Mobile Housing      : 1363  
##  Patio Home              :  721  
##  Gemini/Twin Home        :  336  
##  (Other)                 :  105

Seperate Data

We consider the rows without CloseDate and ClosePrice as testing sets, the others as training sets.

training <- ncData[!is.na(ncData$CloseDate),]
testing <- ncData[is.na(ncData$CloseDate),]

Then we seperate training into training sets and validation sets.

library(caret)
## Loading required package: lattice
inTrain <- createDataPartition(y=training$ClosePrice,p=0.8,list=FALSE)
Training <- training[inTrain,] 
Validation <- training[-inTrain,]

Build a Model considering list date as close date with ListPrice.

If we consider the list date as close date to predict the close price, we will need to count on the list price and the features of the house. As it’s a regression problem, the Algorithm I choose is Generalized Linear Model(glm). The predictors we have are either numeric values or factors. The glm can convert factors into numeric value to fit into the linear model. After we build the model, we use t-test to see the performance compared to validation sets.

modelFit <- train(ClosePrice~LivingArea+NumBedrooms+NumBaths+Pool+ExteriorStories+ListPrice+GeoLat+GeoLon+DwellingType,method="glm",data=Training)
prediction <- predict(modelFit,Validation)
qplot(Validation$ClosePrice,prediction,main="ClosePrice Prediction Performance",xlab="ClosePrice",ylab="Predicition",color="house")

t.test(prediction,Validation$ClosePrice,paired=TRUE,var.equal=TRUE)
## 
##  Paired t-test
## 
## data:  prediction and Validation$ClosePrice
## t = -0.28747, df = 7583, p-value = 0.7738
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -403.9180  300.6029
## sample estimates:
## mean of the differences 
##               -51.65751

We can see the mean of differences falls into the 95% confidence interval. We can see the model is a pretty good fit. To predict the close price, we use this model and put output in a new data sets saved as predictClose1.csv.

ClosePrice <- predict(modelFit,testing)
result <- cbind(testing[,-c(11,12,13)],ClosePrice)
write.csv(result,file="~/Downloads/predictClose1.csv")

Build a Model considering list date as close date without ListPrice.

We can repeat the model buil process exclude ListPrice as predictor.

modelFit1 <- train(ClosePrice~LivingArea+NumBedrooms+NumBaths+Pool+ExteriorStories+GeoLat+GeoLon+DwellingType,method="glm",data=Training)
prediction1 <- predict(modelFit1,Validation)
qplot(Validation$ClosePrice,prediction1,main="ClosePrice Prediction Performance",xlab="ClosePrice",ylab="Predicition",color="house")

t.test(prediction1,Validation$ClosePrice,paired=TRUE,var.equal=TRUE)
## 
##  Paired t-test
## 
## data:  prediction1 and Validation$ClosePrice
## t = 0.083547, df = 7583, p-value = 0.9334
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4507.376  4908.690
## sample estimates:
## mean of the differences 
##                 200.657

The performace shows in scatter point looks pretty bad. However, in t-test, the mean of difference falls into the 95% confidence interval. Though the p-value is 0.6264, consider a little high. We can say the model is significantly predict the closePrice, but not very robust.

Possible improvement

  1. To improve the model, we should also consider the Listing status, postdate and on-market days to build the model. And also use NLP to extract useful information from the PublicRemarks.

  2. To build a data product for in a production environment to predict values of homes in real-time.To accomplish this, we need to initiate the model for each house taking all the house feautres, List Price. Meanwhile, taking current date as input for the model to update the price every several days.