In this report, we have two task. 1. Do some exploratory analysis for dataset. 2. Build models to predict the close price with and without list price.
Data is downloaded online. The information including property features and price. Here is a briefe summary of the data.
## ListingId LivingArea NumBedrooms NumBaths
## Min. :5077399 Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.:5098855 1st Qu.: 1427 1st Qu.: 3.000 1st Qu.: 2.000
## Median :5120464 Median : 1837 Median : 3.000 Median : 2.000
## Mean :5120968 Mean : 2240 Mean : 3.292 Mean : 2.387
## 3rd Qu.:5143012 3rd Qu.: 2446 3rd Qu.: 4.000 3rd Qu.: 2.500
## Max. :5178286 Max. :9999999 Max. :16.000 Max. :1047.000
## NA's :209
## Pool ExteriorStories ListDate
## Both Private & Community: 2260 Min. : 1.000 Min. :2014-03-01
## Community :14249 1st Qu.: 1.000 1st Qu.:2014-04-10
## None :26700 Median : 1.000 Median :2014-05-23
## Private :14049 Mean : 1.374 Mean :2014-05-25
## 3rd Qu.: 2.000 3rd Qu.:2014-07-10
## Max. :23.000 Max. :2014-08-31
##
## ListPrice GeoLat GeoLon CloseDate
## Min. : 750 Min. :30.21 Min. :-151.06 Min. :2014-03-02
## 1st Qu.: 149900 1st Qu.:33.36 1st Qu.:-112.20 1st Qu.:2014-06-25
## Median : 215000 Median :33.49 Median :-111.99 Median :2014-08-18
## Mean : 289744 Mean :33.51 Mean :-111.99 Mean :2014-08-22
## 3rd Qu.: 325000 3rd Qu.:33.62 3rd Qu.:-111.79 3rd Qu.:2014-10-09
## Max. :16950000 Max. :60.48 Max. : -95.46 Max. :2015-03-31
## NA's :99 NA's :99 NA's :19324
## ClosePrice ListingStatus
## Min. : 1 Active : 2525
## 1st Qu.: 140000 Cancelled :11284
## Median : 195000 Closed :37606
## Mean : 240641 Expired : 4750
## 3rd Qu.: 280475 Pending : 770
## Max. :5500000 Temp Off Market: 323
## NA's :19324
## DwellingType
## Single Family - Detached:48054
## Townhouse : 3813
## Apartment Style/Flat : 2866
## Mfg/Mobile Housing : 1363
## Patio Home : 721
## Gemini/Twin Home : 336
## (Other) : 105
## Warning: Removed 19324 rows containing missing values (geom_point).
There are several null data. Here is how we handle it: For LivingArea, we take the median number to fill null. For GeoLat and GeoLon, we take the average number to fill null. For CloseDate and ClosePrice, that will be our outcome, we can ingore the null here. To build the model, we only take the data with CloseDate and ClosePrice. Here we go, get the new dataset without null data named ncData.
ncData <- cData
ncData[is.na(ncData$LivingArea),"LivingArea"]=1837
ncData[is.na(ncData$GeoLat),"GeoLat"]=33.51
ncData[is.na(ncData$GeoLon),"GeoLon"]=-111.79
Take a look at summary of data, no null data anymore.
## ListingId LivingArea NumBedrooms NumBaths
## Min. :5077399 Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.:5098855 1st Qu.: 1428 1st Qu.: 3.000 1st Qu.: 2.000
## Median :5120464 Median : 1837 Median : 3.000 Median : 2.000
## Mean :5120968 Mean : 2238 Mean : 3.292 Mean : 2.387
## 3rd Qu.:5143012 3rd Qu.: 2443 3rd Qu.: 4.000 3rd Qu.: 2.500
## Max. :5178286 Max. :9999999 Max. :16.000 Max. :1047.000
##
## Pool ExteriorStories ListDate
## Both Private & Community: 2260 Min. : 1.000 Min. :2014-03-01
## Community :14249 1st Qu.: 1.000 1st Qu.:2014-04-10
## None :26700 Median : 1.000 Median :2014-05-23
## Private :14049 Mean : 1.374 Mean :2014-05-25
## 3rd Qu.: 2.000 3rd Qu.:2014-07-10
## Max. :23.000 Max. :2014-08-31
##
## ListPrice GeoLat GeoLon CloseDate
## Min. : 750 Min. :30.21 Min. :-151.06 Min. :2014-03-02
## 1st Qu.: 149900 1st Qu.:33.36 1st Qu.:-112.20 1st Qu.:2014-06-25
## Median : 215000 Median :33.49 Median :-111.99 Median :2014-08-18
## Mean : 289744 Mean :33.51 Mean :-111.98 Mean :2014-08-22
## 3rd Qu.: 325000 3rd Qu.:33.62 3rd Qu.:-111.79 3rd Qu.:2014-10-09
## Max. :16950000 Max. :60.48 Max. : -95.46 Max. :2015-03-31
## NA's :19324
## ClosePrice ListingStatus
## Min. : 1 Active : 2525
## 1st Qu.: 140000 Cancelled :11284
## Median : 195000 Closed :37606
## Mean : 240641 Expired : 4750
## 3rd Qu.: 280475 Pending : 770
## Max. :5500000 Temp Off Market: 323
## NA's :19324
## DwellingType
## Single Family - Detached:48054
## Townhouse : 3813
## Apartment Style/Flat : 2866
## Mfg/Mobile Housing : 1363
## Patio Home : 721
## Gemini/Twin Home : 336
## (Other) : 105
We consider the rows without CloseDate and ClosePrice as testing sets, the others as training sets.
training <- ncData[!is.na(ncData$CloseDate),]
testing <- ncData[is.na(ncData$CloseDate),]
Then we seperate training into training sets and validation sets.
library(caret)
## Loading required package: lattice
inTrain <- createDataPartition(y=training$ClosePrice,p=0.8,list=FALSE)
Training <- training[inTrain,]
Validation <- training[-inTrain,]
If we consider the list date as close date to predict the close price, we will need to count on the list price and the features of the house. As it’s a regression problem, the Algorithm I choose is Generalized Linear Model(glm). The predictors we have are either numeric values or factors. The glm can convert factors into numeric value to fit into the linear model. After we build the model, we use t-test to see the performance compared to validation sets.
modelFit <- train(ClosePrice~LivingArea+NumBedrooms+NumBaths+Pool+ExteriorStories+ListPrice+GeoLat+GeoLon+DwellingType,method="glm",data=Training)
prediction <- predict(modelFit,Validation)
qplot(Validation$ClosePrice,prediction,main="ClosePrice Prediction Performance",xlab="ClosePrice",ylab="Predicition",color="house")
t.test(prediction,Validation$ClosePrice,paired=TRUE,var.equal=TRUE)
##
## Paired t-test
##
## data: prediction and Validation$ClosePrice
## t = -0.28747, df = 7583, p-value = 0.7738
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -403.9180 300.6029
## sample estimates:
## mean of the differences
## -51.65751
We can see the mean of differences falls into the 95% confidence interval. We can see the model is a pretty good fit. To predict the close price, we use this model and put output in a new data sets saved as predictClose1.csv.
ClosePrice <- predict(modelFit,testing)
result <- cbind(testing[,-c(11,12,13)],ClosePrice)
write.csv(result,file="~/Downloads/predictClose1.csv")
We can repeat the model buil process exclude ListPrice as predictor.
modelFit1 <- train(ClosePrice~LivingArea+NumBedrooms+NumBaths+Pool+ExteriorStories+GeoLat+GeoLon+DwellingType,method="glm",data=Training)
prediction1 <- predict(modelFit1,Validation)
qplot(Validation$ClosePrice,prediction1,main="ClosePrice Prediction Performance",xlab="ClosePrice",ylab="Predicition",color="house")
t.test(prediction1,Validation$ClosePrice,paired=TRUE,var.equal=TRUE)
##
## Paired t-test
##
## data: prediction1 and Validation$ClosePrice
## t = 0.083547, df = 7583, p-value = 0.9334
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4507.376 4908.690
## sample estimates:
## mean of the differences
## 200.657
The performace shows in scatter point looks pretty bad. However, in t-test, the mean of difference falls into the 95% confidence interval. Though the p-value is 0.6264, consider a little high. We can say the model is significantly predict the closePrice, but not very robust.
To improve the model, we should also consider the Listing status, postdate and on-market days to build the model. And also use NLP to extract useful information from the PublicRemarks.
To build a data product for in a production environment to predict values of homes in real-time.To accomplish this, we need to initiate the model for each house taking all the house feautres, List Price. Meanwhile, taking current date as input for the model to update the price every several days.