Data source
The price data were for properties in a selected area in Seattle, WA, that were sold on the market within the past 2000 days (from 2010-08-06 to 2016-01-27). House types of interest are single family house, townhouse and condo, with a size in square feet between 700 to 2000. The area of interest is a rectangular zone near Seattle, from 47.6 to 47.78 degrees for latitude and from -122.37 to -122.27 degrees for longitude. The data was downloaded repeated from links like this. The website applies an upper limit for the number of records shown in a single page, therefore I implemented an automated process using RSelenium to download all the data, and the code can be viewed here.
Prediction model
Outcome: sale price
Explanatory variables: home type, zip code, # bedrooms, # bathrooms, square feet, lot size, year built, parking spots, parking type
Final models can have subsets of the given explanatory variables with various transformations.
The metric used to evaluate model performance are mean squared error (MSE) and \(R^2\). I used a 90%/10% training/testing data split.
library(dplyr)
library(ggplot2)
library(car)
#== Load data ==#
#data downloaed from Redfin
prop_dat <- read.csv("./data_redfin/seattle_2000day_2b_700min_2000max_160125.csv")
names(prop_dat) <- tolower(names(prop_dat))
table(prop_dat$home.type)
##
## Condo/Coop Multi-Family (2-4 Unit)
## 3450 10
## Multi-Family (5+ Unit) Single Family Residential
## 3 5393
## Townhouse Vacant Land
## 2295 2
#hist(prop_dat$original.list.price-prop_dat$list.price)
#== Cleaning ==#
#select variables of interest
#make a factorized zipcode, lot yes/no, continuous sale date as integer (2009-1-1 as origin)
property_sub <- prop_dat %>%
filter(home.type %in% c('Single Family Residential','Townhouse','Condo/Coop')) %>%
mutate(zip_f = as.factor(zip),sale.date.int = as.integer(as.POSIXct(last.sale.date)-as.POSIXct('2009-1-1'))) %>%
select(home.type,zip_f,beds,baths,location,sqft,lot.size,year.built,parking.spots,
parking.type,latitude,longitude,sale.date.int,list.price,original.list.price,last.sale.price)
property_sub <- droplevels(property_sub)
levels(property_sub$home.type)[2] <- "Single Family"
table(property_sub$zip_f)
##
## 98010 98013 98019 98021 98027 98028 98036 98043 98101 98102 98103 98104
## 1 1 1 2 1 9 31 42 317 599 1798 113
## 98105 98107 98109 98111 98112 98113 98114 98115 98116 98117 98119 98121
## 527 315 545 1 565 1 2 1332 2 623 544 310
## 98122 98125 98127 98128 98133 98144 98146 98155 98164 98177 98199 98215
## 783 1126 1 1 941 195 1 198 3 204 1 1
## 99107
## 1
table(prop_dat$location)
##
## 98021 98027 98028
## 1 1 4
## 98036 98043 98101
## 8 9 147
## 98102 98103 98104
## 252 787 54
## 98105 98107 98109
## 262 136 218
## 98112 98115 98117
## 257 598 234
## 98119 98121 98122
## 222 140 308
## 98125 98133 98144
## 452 351 69
## 98155 98164 98177
## 74 2 92
## Arboretum Aurora Village Ballard
## 6 3 278
## Ballinger Belltown Belvedere Terrace
## 1 139 7
## Bitter Lake Briarcrest Brier
## 81 7 16
## Broadview Bryant Burien
## 144 99 1
## Capitol Hill Cedar Park Central Area
## 434 39 232
## Crestview Hills Crown Hill Denny Blaine
## 7 89 3
## Denny Triangle Downtown East Union
## 7 135 1
## Eastlake Echo Lake Firlands
## 118 49 1
## First Hill Fremont Green Lake
## 106 215 486
## Greenwood Haller Lake Hawthorne Hills
## 370 144 32
## Hillwood Innis Arden Interbay
## 4 1 2
## International District Jackson Park Judkins
## 3 66 24
## Kenmore Lake City Lake Forest Park
## 1 82 13
## Lake Union Laurelhurst Leschi
## 11 23 52
## Madison Park Madison Valley Madrona
## 72 168 37
## Maple Leaf Matthews Beach Meadowbrook
## 209 27 87
## Meridian Park Montlake Mountlake Terrace
## 3 19 32
## North Capitol Hill North City Northgate
## 32 4 141
## Olympic Hills Paramount Park Phinney Ridge
## 116 1 114
## Pinehurst Pioneer Square Portage Bay
## 129 15 7
## Queen Anne Ravenna Richmond Beach
## 598 127 5
## Richmond Highlands Ridgecrest Roosevelt
## 28 27 31
## Sand Point Se Mountlake Terrace Seattle
## 77 1 134
## Shoreline South Lake Union University District
## 108 24 68
## Uplake Victory Heights View Ridge
## 1 36 36
## Wallingford Washington Park Wedgwood
## 225 2 131
## Westlake Whittier Windermere
## 17 42 10
## Woodland Park
## 1
It seems zip code is a more accurate surrogate for location, given that the variable ‘location’ also includes levels like zip code.
#remove records with <10 obs of zip for more stable inference
in_zip <- property_sub %>% count(zip_f) %>% filter(n>=10) %>% select(zip_f) %>% unlist
property_sub <- filter(property_sub,zip_f %in% in_zip)
#price
hist(property_sub$last.sale.price)
tally(group_by(property_sub,home.type,last.sale.price)) %>%
group_by(home.type) %>% top_n(6,last.sale.price)
## Source: local data frame [18 x 3]
## Groups: home.type [3]
##
## home.type last.sale.price n
## (fctr) (dbl) (int)
## 1 Condo/Coop 2250000 1
## 2 Condo/Coop 2392000 1
## 3 Condo/Coop 2400000 1
## 4 Condo/Coop 2425000 1
## 5 Condo/Coop 2600000 1
## 6 Condo/Coop 2700000 1
## 7 Single Family 1295000 2
## 8 Single Family 1380000 1
## 9 Single Family 1400000 2
## 10 Single Family 1525000 1
## 11 Single Family 1550000 1
## 12 Single Family 2800000 1
## 13 Townhouse 939950 2
## 14 Townhouse 960000 1
## 15 Townhouse 980000 1
## 16 Townhouse 1000000 1
## 17 Townhouse 1175000 1
## 18 Townhouse 1200000 1
There seems to be a single family house with price much higher than others. We’d like to remove this potential influential data point.
property_sub <- filter(property_sub,last.sale.price<2000000)
#beds and baths
table(property_sub$beds)
##
## 0 1 2 3 4 5 6 8 9 12
## 4 41 5473 4697 768 74 6 1 2 2
table(property_sub$baths)
##
## 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 4 4.25
## 1 24 2730 19 1158 2249 2148 895 1146 93 243 216 123 1 1
## 4.5 8 10
## 4 1 1
#only keep data with explanatory variables well sampled
property_sub <- filter(property_sub,beds<=5,beds>=1,baths<4,baths>=0.75)
#square feet
hist(property_sub$sqft)
range(property_sub$sqft,na.rm = T)
## [1] 430 6680
property_sub <- filter(property_sub,property_sub$sqft>=700,property_sub$sqft<=2000)
#lot.size
range(property_sub$lot.size,na.rm = T)
## [1] 1 1253661
boxplot(property_sub$lot.size)
op = par(mfrow = c(2,2))
hist(property_sub$lot.size)
hist(property_sub$lot.size[property_sub$home.type=="Single Family"])
hist(property_sub$lot.size[property_sub$home.type=='Townhouse'])
hist(property_sub$lot.size[property_sub$home.type=='Condo/Coop'])
par(op)
#the 6 smallest and largest lot grouped by home type
tally(group_by(property_sub,home.type,lot.size)) %>%
group_by(home.type) %>% top_n(6,desc(lot.size))
## Source: local data frame [18 x 3]
## Groups: home.type [3]
##
## home.type lot.size n
## (fctr) (int) (int)
## 1 Condo/Coop 5 1
## 2 Condo/Coop 864 1
## 3 Condo/Coop 2589 1
## 4 Condo/Coop 3200 1
## 5 Condo/Coop 3300 1
## 6 Condo/Coop 3348 2
## 7 Single Family 809 1
## 8 Single Family 979 1
## 9 Single Family 983 1
## 10 Single Family 984 1
## 11 Single Family 991 4
## 12 Single Family 1000 3
## 13 Townhouse 1 14
## 14 Townhouse 500 4
## 15 Townhouse 600 3
## 16 Townhouse 640 3
## 17 Townhouse 651 1
## 18 Townhouse 669 1
tally(group_by(property_sub,home.type,lot.size)) %>%
group_by(home.type) %>% top_n(6,lot.size)
## Source: local data frame [18 x 3]
## Groups: home.type [3]
##
## home.type lot.size n
## (fctr) (int) (int)
## 1 Condo/Coop 261360 1
## 2 Condo/Coop 292300 4
## 3 Condo/Coop 361500 2
## 4 Condo/Coop 392040 1
## 5 Condo/Coop 400921 11
## 6 Condo/Coop 1253661 1
## 7 Single Family 25254 1
## 8 Single Family 27075 1
## 9 Single Family 29185 1
## 10 Single Family 31878 1
## 11 Single Family 42084 1
## 12 Single Family 47916 1
## 13 Townhouse 36500 1
## 14 Townhouse 36865 1
## 15 Townhouse 43560 1
## 16 Townhouse 47075 5
## 17 Townhouse 47826 2
## 18 Townhouse 61448 1
property_sub$lot.size[property_sub$lot.size %in% c(1,5)] <- NA
property_sub$lot.size[property_sub$lot.size == 1253661] <- 400921 #winsorize
#impute lot size for missing data
property_sub = property_sub %>% group_by(home.type) %>%
mutate(lot_median = median(lot.size,na.rm=T),lot.size.imp = ifelse(is.na(lot.size),lot_median,lot.size))
hist(property_sub$lot.size.imp)
#try log transformation
hist(log(property_sub$lot.size.imp))
#year built
hist(property_sub$year.built)
range(property_sub$year.built,na.rm = T)
## [1] 1051 2015
sort(property_sub$year.built)[1:10]
## [1] 1051 1651 1891 1893 1896 1900 1900 1900 1900 1900
property_sub <- filter(property_sub,year.built>=1900)
#parking.spots
table(property_sub$parking.spots,exclude = NULL)
##
## 0 1 2 3 4 5 6 154 200 240 242 260 302 <NA>
## 49 5194 1100 39 10 1 2 1 1 1 1 1 1 4541
property_sub$parking.spots[is.na(property_sub$parking.spots)] <- 0
property_sub <- filter(property_sub,parking.spots<=4) #remove records with >4 parking spots
#parking type
table(property_sub$parking.type,exclude = NULL)
##
## Garage <NA>
## 6606 4301 26
property_sub$parking.type[is.na(property_sub$parking.type)] <- ""
levels(property_sub$parking.type)[1] <- 'other' #non-garage parking
#latitude & longtitude
sum(is.na(property_sub$latitude))
## [1] 1
sum(is.na(property_sub$longitude))
## [1] 1
property_sub <- filter(property_sub,!is.na(latitude),!is.na(longitude))
range(property_sub$latitude)
## [1] 47.59631 47.78353
range(property_sub$longitude)
## [1] -122.3762 -122.2639
#seems both ok
#create tranformed variables
property_sub <- property_sub %>%
mutate(log_lotsize = log(lot.size.imp))
#drop unused levels
property_sub <- droplevels(property_sub)
property_sub <- select(property_sub,
last.sale.price,home.type,zip_f,beds,baths,year.built,sale.date.int,
parking.spots,parking.type,sale.date.int,sqft,log_lotsize,latitude,longitude)
property_sub = property_sub[complete.cases(property_sub),]
hist(property_sub$last.sale.price)
Try log transformation since the the distribution is skewed to the right.
hist(log(property_sub$last.sale.price))
#plot only a sub-sample to save time
set.seed(123)
pairs(sample_n(ungroup(property_sub),1000) %>% select(last.sale.price,beds,baths,year.built,parking.spots,sqft,log_lotsize,sale.date.int))
We can observe non-linear relationships between price and year.built, sqft (square feet), sale.date.int. The relationship with other variables are not obvious.
plot(property_sub$year.built,log(property_sub$last.sale.price))
lines(lowess(property_sub$year.built,log(property_sub$last.sale.price)), col = 2)
qplot(home.type,last.sale.price,data=property_sub,geom=c("boxplot"))
qplot(parking.type,last.sale.price,data=property_sub,geom=c("boxplot"))
#split data into training and testing sets
set.seed(123)
n = dim(property_sub)[1]
inTrain = sample(1:n,size = round(0.9*n))
training = property_sub[inTrain,]
#check home type distribution
home_type_frac = table(property_sub$home.type)
home_type_frac/sum(home_type_frac)
##
## Condo/Coop Single Family Townhouse
## 0.3100073 0.4819795 0.2080132
home_type_frac = table(training$home.type)
home_type_frac/sum(home_type_frac) #approximately the same proportions
##
## Condo/Coop Single Family Townhouse
## 0.3093810 0.4832808 0.2073381
#Normalize continuous variables. Note testing data use the same center and scale as training data.
train_scale = scale(training[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")])
training[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")] <- train_scale
y_train = scale(training$last.sale.price)
We’d expect that there are interactions between the type of property and other variables including primarily sqft, year.built, sale.date.int. Create extra variables so that the crPlots can be drawn for diagnostics.
training <- mutate(training,single_sqft = sqft*(home.type=="Single Family"),
th_sqft = sqft*(home.type=="Townhouse"),
single_yearb = year.built*(home.type=="Single Family"),
th_yearb = year.built*(home.type=="Townhouse"),
single_saledate = sale.date.int*(home.type=="Single Family"),
th_saledate = sale.date.int*(home.type=="Townhouse"))
mod_lm1= lm(y_train~factor(home.type)+zip_f+beds+baths+sqft+log_lotsize+year.built+sale.date.int+parking.spots+factor(parking.type)+single_sqft+th_sqft+single_yearb+th_yearb+single_saledate+th_saledate,data = training)
summary(mod_lm1)
##
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths +
## sqft + log_lotsize + year.built + sale.date.int + parking.spots +
## factor(parking.type) + single_sqft + th_sqft + single_yearb +
## th_yearb + single_saledate + th_saledate, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1432 -0.3234 -0.0265 0.2620 4.8222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.910263 0.125283 -7.266 4.00e-13 ***
## factor(home.type)Single Family 0.244148 0.032781 7.448 1.03e-13 ***
## factor(home.type)Townhouse -0.380892 0.060702 -6.275 3.65e-10 ***
## zip_f98043 -0.290287 0.153092 -1.896 0.057968 .
## zip_f98101 2.597296 0.127322 20.399 < 2e-16 ***
## zip_f98102 1.443823 0.123859 11.657 < 2e-16 ***
## zip_f98103 0.831240 0.121484 6.842 8.26e-12 ***
## zip_f98104 1.494876 0.136562 10.946 < 2e-16 ***
## zip_f98105 0.892983 0.123628 7.223 5.46e-13 ***
## zip_f98107 0.814744 0.125881 6.472 1.01e-10 ***
## zip_f98109 1.184760 0.123920 9.561 < 2e-16 ***
## zip_f98112 1.352922 0.123305 10.972 < 2e-16 ***
## zip_f98115 0.646404 0.121222 5.332 9.91e-08 ***
## zip_f98117 0.612297 0.122967 4.979 6.49e-07 ***
## zip_f98119 1.204477 0.123767 9.732 < 2e-16 ***
## zip_f98121 2.286939 0.127093 17.994 < 2e-16 ***
## zip_f98122 0.933487 0.122718 7.607 3.07e-14 ***
## zip_f98125 0.173769 0.120986 1.436 0.150956
## zip_f98133 0.146861 0.121335 1.210 0.226164
## zip_f98144 0.686992 0.128858 5.331 9.96e-08 ***
## zip_f98155 -0.038474 0.127149 -0.303 0.762207
## zip_f98177 0.422386 0.127296 3.318 0.000909 ***
## beds -0.027558 0.008172 -3.372 0.000748 ***
## baths 0.034910 0.008903 3.921 8.87e-05 ***
## sqft 0.784003 0.013799 56.815 < 2e-16 ***
## log_lotsize -0.022184 0.015152 -1.464 0.143214
## year.built 0.250440 0.018898 13.252 < 2e-16 ***
## sale.date.int 0.257464 0.012053 21.361 < 2e-16 ***
## parking.spots 0.012624 0.009226 1.368 0.171276
## factor(parking.type)Garage 0.006784 0.017304 0.392 0.695006
## single_sqft -0.534494 0.015992 -33.423 < 2e-16 ***
## th_sqft -0.382604 0.022735 -16.829 < 2e-16 ***
## single_yearb -0.259158 0.022112 -11.720 < 2e-16 ***
## th_yearb 0.068235 0.047870 1.425 0.154068
## single_saledate -0.022775 0.014214 -1.602 0.109119
## th_saledate 0.048947 0.017654 2.773 0.005572 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6026 on 9803 degrees of freedom
## Multiple R-squared: 0.6382, Adjusted R-squared: 0.6369
## F-statistic: 494 on 35 and 9803 DF, p-value: < 2.2e-16
op = par(mfrow = c(2,2))
plot(mod_lm1)
par(op)
From the 1st and the 3rd plots it appears the variance is not homogeneous. The 2nd plots shows that residual is not normal. The 4th plot tells us that there are no quite influential data points so we are okay.
qqPlot(mod_lm1, main="QQ Plot") #qq plot for studentized resid
Again the residual is not normally distributed.
spreadLevelPlot(mod_lm1)
##
## Suggested power transformation: 0.8234643
The residual variance increases as the mean values increase, which is reflected by a red line with a positive slope/trend. We can try to use transformed outcome variable (either square root or log transformation).
crPlots(mod_lm1)
We see there are nonlinear trends between the outcome variable with sqft, year.built, sale.date.int and log_lotsize. Given the shapes of nonlinearity, we can transform sqft and use polynomials for the other two variables. We don’t have much to do with log_lotsize, because a great proportion of observations for this variable were imputed.
Then we test the performance of this naive model.
#Normalize continuous variables for testing data. Note testing data use the same center and scale as training data.
testing = property_sub[-inTrain,]
testing[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")] =
scale(testing[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")],
center = attr(train_scale,'scaled:center'),
scale = attr(train_scale,'scaled:scale'))
y_test = scale(testing$last.sale.price,
center = attr(y_train,'scaled:center'),
scale = attr(y_train,'scaled:scale'))
testing <- mutate(testing,single_sqft = sqft*(home.type=="Single Family"),
th_sqft = sqft*(home.type=="Townhouse"),
single_yearb = year.built*(home.type=="Single Family"),
th_yearb = year.built*(home.type=="Townhouse"),
single_saledate = sale.date.int*(home.type=="Single Family"),
th_saledate = sale.date.int*(home.type=="Townhouse"))
y_pred_lm = predict(mod_lm1,newdata = testing)
mse_lm = sum((y_pred_lm-y_test)^2)/length(y_test)
plot(y_test,y_pred_lm,xlab="True normalized price",ylab="Predicted normalized price",
main="Prediction using naive linear regression")
text(4,-1,substitute(r^2 == r2,list(r2=cor(y_test,y_pred_lm))),adj=0)
text(4,-1.5,substitute(MSE == r2,list(r2=mse_lm)),adj=0)
abline(0,1)
Make transformed variables.
property_sub = property_sub %>%
mutate(sqrt_sqft = sqrt(sqft),sqrt_price=sqrt(last.sale.price),log_price=log(last.sale.price))
training = property_sub[inTrain,]
train_scale = scale(training[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")])
training[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")] <- train_scale
training <- mutate(training,single_sqft = sqrt_sqft*(home.type=="Single Family"),
th_sqft = sqrt_sqft*(home.type=="Townhouse"),
single_yearb = year.built*(home.type=="Single Family"),
th_yearb = year.built*(home.type=="Townhouse"),
single_yearb2 = year.built^2*(home.type=="Single Family"),
th_yearb2 = year.built^2*(home.type=="Townhouse"),
single_saledate = sale.date.int*(home.type=="Single Family"),
th_saledate = sale.date.int*(home.type=="Townhouse"),
single_saledate2 = sale.date.int^2*(home.type=="Single Family"),
th_saledate2 = sale.date.int^2*(home.type=="Townhouse"))
Test the updated explanatory variables. Try square root transformed outcome variable.
y_train = scale(training$sqrt_price)
mod_lm2= lm(y_train~factor(home.type)+zip_f+beds+baths+sqrt_sqft+log_lotsize+year.built+I(year.built^2)+sale.date.int+I(sale.date.int^2)+parking.spots+factor(parking.type)+single_sqft+th_sqft+single_yearb+th_yearb+single_saledate+th_saledate+single_yearb2+th_yearb2+single_saledate2+th_saledate2,data = training)
summary(mod_lm2)
##
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths +
## sqrt_sqft + log_lotsize + year.built + I(year.built^2) +
## sale.date.int + I(sale.date.int^2) + parking.spots + factor(parking.type) +
## single_sqft + th_sqft + single_yearb + th_yearb + single_saledate +
## th_saledate + single_yearb2 + th_yearb2 + single_saledate2 +
## th_saledate2, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1762 -0.3195 -0.0122 0.2881 3.6439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.447592 0.117558 -12.314 < 2e-16 ***
## factor(home.type)Single Family 0.641675 0.035253 18.202 < 2e-16 ***
## factor(home.type)Townhouse -0.233678 0.062696 -3.727 0.000195 ***
## zip_f98043 -0.345120 0.142201 -2.427 0.015243 *
## zip_f98101 2.469077 0.118217 20.886 < 2e-16 ***
## zip_f98102 1.586292 0.115008 13.793 < 2e-16 ***
## zip_f98103 0.958805 0.112814 8.499 < 2e-16 ***
## zip_f98104 1.439073 0.127329 11.302 < 2e-16 ***
## zip_f98105 1.043115 0.114813 9.085 < 2e-16 ***
## zip_f98107 0.959450 0.116894 8.208 2.54e-16 ***
## zip_f98109 1.361063 0.115071 11.828 < 2e-16 ***
## zip_f98112 1.507362 0.114510 13.164 < 2e-16 ***
## zip_f98115 0.764960 0.112629 6.792 1.17e-11 ***
## zip_f98117 0.730921 0.114193 6.401 1.62e-10 ***
## zip_f98119 1.347182 0.114925 11.722 < 2e-16 ***
## zip_f98121 2.299109 0.118073 19.472 < 2e-16 ***
## zip_f98122 1.027653 0.113994 9.015 < 2e-16 ***
## zip_f98125 0.206705 0.112447 1.838 0.066056 .
## zip_f98133 0.139894 0.112755 1.241 0.214748
## zip_f98144 0.732895 0.119707 6.122 9.57e-10 ***
## zip_f98155 -0.079890 0.118262 -0.676 0.499356
## zip_f98177 0.490565 0.118432 4.142 3.47e-05 ***
## beds -0.023601 0.007646 -3.087 0.002028 **
## baths 0.061434 0.008478 7.246 4.62e-13 ***
## sqrt_sqft 0.668780 0.012700 52.658 < 2e-16 ***
## log_lotsize 0.020895 0.014541 1.437 0.150753
## year.built 0.235724 0.017583 13.407 < 2e-16 ***
## I(year.built^2) 0.298103 0.016470 18.100 < 2e-16 ***
## sale.date.int 0.298438 0.011912 25.053 < 2e-16 ***
## I(sale.date.int^2) 0.073248 0.010409 7.037 2.10e-12 ***
## parking.spots 0.028007 0.008639 3.242 0.001191 **
## factor(parking.type)Garage 0.049652 0.016215 3.062 0.002204 **
## single_sqft -0.410965 0.014887 -27.605 < 2e-16 ***
## th_sqft -0.294038 0.021601 -13.613 < 2e-16 ***
## single_yearb -0.241869 0.022732 -10.640 < 2e-16 ***
## th_yearb -0.182096 0.051445 -3.540 0.000403 ***
## single_saledate -0.022013 0.014302 -1.539 0.123798
## th_saledate 0.029953 0.017704 1.692 0.090710 .
## single_yearb2 -0.283840 0.019782 -14.348 < 2e-16 ***
## th_yearb2 0.156439 0.044498 3.516 0.000441 ***
## single_saledate2 -0.002811 0.013295 -0.211 0.832549
## th_saledate2 0.003742 0.016470 0.227 0.820272
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5595 on 9797 degrees of freedom
## Multiple R-squared: 0.6882, Adjusted R-squared: 0.6869
## F-statistic: 527.5 on 41 and 9797 DF, p-value: < 2.2e-16
op = par(mfrow = c(2,2))
plot(mod_lm2)
par(op)
The residual variance is better compared to the previous model, but is still heterogeneous.
spreadLevelPlot(mod_lm2)
##
## Suggested power transformation: 0.8984082
crPlots(mod_lm2)
There is still nonlinearity.
Try log transformed outcome variable.
y_train = scale(training$log_price)
mod_lm3= lm(y_train~factor(home.type)+zip_f+beds+baths+sqrt_sqft+log_lotsize+year.built+I(year.built^2)+sale.date.int+I(sale.date.int^2)+parking.spots+factor(parking.type)+single_sqft+th_sqft+single_yearb+th_yearb+single_saledate+th_saledate+single_yearb2+th_yearb2+single_saledate2+th_saledate2,data = training)
summary(mod_lm3)
##
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths +
## sqrt_sqft + log_lotsize + year.built + I(year.built^2) +
## sale.date.int + I(sale.date.int^2) + parking.spots + factor(parking.type) +
## single_sqft + th_sqft + single_yearb + th_yearb + single_saledate +
## th_saledate + single_yearb2 + th_yearb2 + single_saledate2 +
## th_saledate2, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8378 -0.3078 0.0155 0.3225 3.0200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.582961 0.117273 -13.498 < 2e-16 ***
## factor(home.type)Single Family 0.819960 0.035167 23.316 < 2e-16 ***
## factor(home.type)Townhouse -0.105759 0.062544 -1.691 0.09088 .
## zip_f98043 -0.354588 0.141856 -2.500 0.01245 *
## zip_f98101 2.250010 0.117930 19.079 < 2e-16 ***
## zip_f98102 1.650969 0.114729 14.390 < 2e-16 ***
## zip_f98103 1.008675 0.112541 8.963 < 2e-16 ***
## zip_f98104 1.493065 0.127020 11.755 < 2e-16 ***
## zip_f98105 1.099512 0.114535 9.600 < 2e-16 ***
## zip_f98107 1.022400 0.116610 8.768 < 2e-16 ***
## zip_f98109 1.448172 0.114792 12.616 < 2e-16 ***
## zip_f98112 1.535062 0.114232 13.438 < 2e-16 ***
## zip_f98115 0.777795 0.112355 6.923 4.71e-12 ***
## zip_f98117 0.775766 0.113916 6.810 1.03e-11 ***
## zip_f98119 1.391110 0.114646 12.134 < 2e-16 ***
## zip_f98121 2.239273 0.117787 19.011 < 2e-16 ***
## zip_f98122 1.073223 0.113717 9.438 < 2e-16 ***
## zip_f98125 0.166449 0.112174 1.484 0.13788
## zip_f98133 0.056713 0.112481 0.504 0.61413
## zip_f98144 0.758124 0.119417 6.349 2.27e-10 ***
## zip_f98155 -0.181391 0.117975 -1.538 0.12419
## zip_f98177 0.491341 0.118144 4.159 3.23e-05 ***
## beds -0.012920 0.007627 -1.694 0.09030 .
## baths 0.079520 0.008458 9.402 < 2e-16 ***
## sqrt_sqft 0.626955 0.012670 49.485 < 2e-16 ***
## log_lotsize 0.038600 0.014506 2.661 0.00780 **
## year.built 0.215644 0.017540 12.294 < 2e-16 ***
## I(year.built^2) 0.318308 0.016430 19.374 < 2e-16 ***
## sale.date.int 0.314091 0.011883 26.431 < 2e-16 ***
## I(sale.date.int^2) 0.083347 0.010384 8.026 1.12e-15 ***
## parking.spots 0.025562 0.008618 2.966 0.00302 **
## factor(parking.type)Garage 0.075649 0.016176 4.677 2.95e-06 ***
## single_sqft -0.379180 0.014851 -25.532 < 2e-16 ***
## th_sqft -0.278899 0.021548 -12.943 < 2e-16 ***
## single_yearb -0.240134 0.022677 -10.589 < 2e-16 ***
## th_yearb -0.138682 0.051320 -2.702 0.00690 **
## single_saledate -0.029509 0.014267 -2.068 0.03864 *
## th_saledate 0.008770 0.017661 0.497 0.61949
## single_yearb2 -0.328192 0.019734 -16.630 < 2e-16 ***
## th_yearb2 0.137642 0.044390 3.101 0.00194 **
## single_saledate2 -0.018470 0.013263 -1.393 0.16378
## th_saledate2 -0.018655 0.016430 -1.135 0.25621
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5582 on 9797 degrees of freedom
## Multiple R-squared: 0.6898, Adjusted R-squared: 0.6885
## F-statistic: 531.3 on 41 and 9797 DF, p-value: < 2.2e-16
op = par(mfrow = c(2,2))
plot(mod_lm3)
par(op)
Now the residual variance appears to be relatively constant.
spreadLevelPlot(mod_lm3)
##
## Suggested power transformation: 0.9742098
crPlots(mod_lm3)
hist(residuals(mod_lm3))
All variable appear to have linear relationship. Now everything seems to be okay, probably except for the assumption of normality of the residuals. However, this deviation from normality caused by the heavy tails is relatively acceptable.
We may think of dropping the interactions between property type and the polynomials of sale date since they are not that significant.
Drop the interactions between property type and the polynomials of sale date.
mod_lm4 = update(mod_lm3,.~.-single_saledate-th_saledate-single_yearb2-th_yearb2-single_saledate2-th_saledate2)
summary(mod_lm4)
##
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths +
## sqrt_sqft + log_lotsize + year.built + I(year.built^2) +
## sale.date.int + I(sale.date.int^2) + parking.spots + factor(parking.type) +
## single_sqft + th_sqft + single_yearb + th_yearb, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7778 -0.3118 0.0211 0.3294 3.1602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4576612 0.1187851 -12.271 < 2e-16 ***
## factor(home.type)Single Family 0.6069790 0.0310559 19.545 < 2e-16 ***
## factor(home.type)Townhouse -0.0522211 0.0579697 -0.901 0.3677
## zip_f98043 -0.3275019 0.1443425 -2.269 0.0233 *
## zip_f98101 2.2825436 0.1200048 19.020 < 2e-16 ***
## zip_f98102 1.6698454 0.1167753 14.300 < 2e-16 ***
## zip_f98103 0.9947526 0.1145377 8.685 < 2e-16 ***
## zip_f98104 1.6523231 0.1289208 12.817 < 2e-16 ***
## zip_f98105 1.0956247 0.1165730 9.399 < 2e-16 ***
## zip_f98107 1.0334297 0.1186798 8.708 < 2e-16 ***
## zip_f98109 1.4385161 0.1168369 12.312 < 2e-16 ***
## zip_f98112 1.5171898 0.1162532 13.051 < 2e-16 ***
## zip_f98115 0.7907614 0.1143560 6.915 4.97e-12 ***
## zip_f98117 0.7776796 0.1159517 6.707 2.10e-11 ***
## zip_f98119 1.3816157 0.1166804 11.841 < 2e-16 ***
## zip_f98121 2.2732885 0.1198519 18.967 < 2e-16 ***
## zip_f98122 1.0562388 0.1157204 9.128 < 2e-16 ***
## zip_f98125 0.1876836 0.1141706 1.644 0.1002
## zip_f98133 0.0784270 0.1144848 0.685 0.4933
## zip_f98144 0.7336307 0.1215223 6.037 1.63e-09 ***
## zip_f98155 -0.1255348 0.1200463 -1.046 0.2957
## zip_f98177 0.5585016 0.1201921 4.647 3.42e-06 ***
## beds 0.0007117 0.0077209 0.092 0.9266
## baths 0.0553488 0.0085008 6.511 7.83e-11 ***
## sqrt_sqft 0.6644254 0.0126360 52.582 < 2e-16 ***
## log_lotsize 0.0592193 0.0146260 4.049 5.19e-05 ***
## year.built 0.2329190 0.0177956 13.089 < 2e-16 ***
## I(year.built^2) 0.1189599 0.0099222 11.989 < 2e-16 ***
## sale.date.int 0.3026725 0.0073507 41.176 < 2e-16 ***
## I(sale.date.int^2) 0.0710626 0.0059503 11.943 < 2e-16 ***
## parking.spots 0.0165312 0.0087415 1.891 0.0586 .
## factor(parking.type)Garage 0.0831364 0.0163377 5.089 3.67e-07 ***
## single_sqft -0.4236719 0.0148532 -28.524 < 2e-16 ***
## th_sqft -0.2806124 0.0214869 -13.060 < 2e-16 ***
## single_yearb -0.1393480 0.0223775 -6.227 4.95e-10 ***
## th_yearb 0.0842578 0.0454266 1.855 0.0637 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5682 on 9803 degrees of freedom
## Multiple R-squared: 0.6783, Adjusted R-squared: 0.6772
## F-statistic: 590.6 on 35 and 9803 DF, p-value: < 2.2e-16
AIC(mod_lm3)
## [1] 16491.25
AIC(mod_lm4)
## [1] 16835.45
anova(mod_lm3,mod_lm4)
## Analysis of Variance Table
##
## Model 1: y_train ~ factor(home.type) + zip_f + beds + baths + sqrt_sqft +
## log_lotsize + year.built + I(year.built^2) + sale.date.int +
## I(sale.date.int^2) + parking.spots + factor(parking.type) +
## single_sqft + th_sqft + single_yearb + th_yearb + single_saledate +
## th_saledate + single_yearb2 + th_yearb2 + single_saledate2 +
## th_saledate2
## Model 2: y_train ~ factor(home.type) + zip_f + beds + baths + sqrt_sqft +
## log_lotsize + year.built + I(year.built^2) + sale.date.int +
## I(sale.date.int^2) + parking.spots + factor(parking.type) +
## single_sqft + th_sqft + single_yearb + th_yearb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 9797 3052.2
## 2 9803 3164.7 -6 -112.52 60.196 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since model 3 has remarkably smaller AIC and ANOVA shows significant difference, we will not drop those interaction terms.
vif(mod_lm3) #test for collinearity
## GVIF Df GVIF^(1/(2*Df))
## factor(home.type) 109.749257 2 3.236685
## zip_f 3.352210 19 1.032344
## beds 1.836947 1 1.355340
## baths 2.258940 1 1.502977
## sqrt_sqft 5.069024 1 2.251449
## log_lotsize 6.644626 1 2.577717
## year.built 9.715114 1 3.116908
## I(year.built^2) 6.305897 1 2.511154
## sale.date.int 4.459425 1 2.111735
## I(sale.date.int^2) 3.810838 1 1.952137
## parking.spots 2.345446 1 1.531485
## factor(parking.type) 1.973412 1 1.404782
## single_sqft 3.658474 1 1.912714
## th_sqft 1.757016 1 1.325525
## single_yearb 6.911675 1 2.629006
## th_yearb 16.162303 1 4.020237
## single_saledate 3.139390 1 1.771832
## th_saledate 1.985659 1 1.409134
## single_yearb2 10.480757 1 3.237400
## th_yearb2 15.523755 1 3.940020
## single_saledate2 4.430656 1 2.104912
## th_saledate2 3.244182 1 1.801161
There does not appear to be any problem of collinearity.
Then we test the performance of this improved model.
#Normalize continuous variables for testing data. Note testing data use the same center and scale as training data.
testing = property_sub[-inTrain,]
testing[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")] =
scale(testing[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")],
center = attr(train_scale,'scaled:center'),
scale = attr(train_scale,'scaled:scale'))
y_test = scale(testing$log_price,
center = attr(y_train,'scaled:center'),
scale = attr(y_train,'scaled:scale'))
testing <- mutate(testing,single_sqft = sqrt_sqft*(home.type=="Single Family"),
th_sqft = sqrt_sqft*(home.type=="Townhouse"),
single_yearb = year.built*(home.type=="Single Family"),
th_yearb = year.built*(home.type=="Townhouse"),
single_yearb2 = year.built^2*(home.type=="Single Family"),
th_yearb2 = year.built^2*(home.type=="Townhouse"),
single_saledate = sale.date.int*(home.type=="Single Family"),
th_saledate = sale.date.int*(home.type=="Townhouse"),
single_saledate2 = sale.date.int^2*(home.type=="Single Family"),
th_saledate2 = sale.date.int^2*(home.type=="Townhouse"))
y_pred_lm = predict(mod_lm3,newdata = testing)
mse_lm = sum((y_pred_lm-y_test)^2)/length(y_test)
plot(y_test,y_pred_lm,xlab="True normalized price",ylab="Predicted normalized price",
main="Prediction using improved linear model")
text(-5,2,substitute(r^2 == r2,list(r2=cor(y_test,y_pred_lm))),adj=0)
text(-5,1.7,substitute(MSE == r2,list(r2=mse_lm)),adj=0)
abline(0,1)
Now we can use this model to answer questions about property prices in areas around Seattle. Say we are interested in properties with specifics as following
We can predict the expected prices by property type and location(zip code).
#== prediction ==#
zip_grid = unique(property_sub$zip_f)
home_grid = c('Single Family','Townhouse','Condo/Coop')
date_grid = as.integer(as.POSIXct(c('2010-1-1','2011-1-1','2012-1-1','2013-1-1','2014-1-1','2015-1-1','2016-1-1'))-as.POSIXct('2009-1-1'))
bed_grid = 3
bath_grid = 2
yearb_grid = 2000
sqft_grid = sqrt(1200)
lot_grid = median(property_sub$log_lotsize)
parking_grid = median(property_sub$parking.spots)
parking_type = 'Garage'
par_mat = expand.grid(zip_grid,home_grid,date_grid,bed_grid,bath_grid,yearb_grid,sqft_grid,lot_grid,parking_grid,parking_type)
names(par_mat) = c("zip_f","home.type","sale.date.int","beds","baths","year.built","sqrt_sqft","log_lotsize","parking.spots","parking.type")
par_mat_std = par_mat
par_mat_std[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")] =
scale(par_mat_std[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")],
center = attr(train_scale,'scaled:center'),
scale = attr(train_scale,'scaled:scale'))
par_mat_std <- mutate(par_mat_std,single_sqft = sqrt_sqft*(home.type=="Single Family"),
th_sqft = sqrt_sqft*(home.type=="Townhouse"),
single_yearb = year.built*(home.type=="Single Family"),
th_yearb = year.built*(home.type=="Townhouse"),
single_yearb2 = year.built^2*(home.type=="Single Family"),
th_yearb2 = year.built^2*(home.type=="Townhouse"),
single_saledate = sale.date.int*(home.type=="Single Family"),
th_saledate = sale.date.int*(home.type=="Townhouse"),
single_saledate2 = sale.date.int^2*(home.type=="Single Family"),
th_saledate2 = sale.date.int^2*(home.type=="Townhouse"))
query_price = exp(predict(mod_lm3,par_mat_std)*attr(y_train,'scaled:scale')+attr(y_train,'scaled:center'))
final_res = cbind(par_mat,query_price)
#== long & lati with zip ==#
zip_longlat <- property_sub %>% group_by(zip_f) %>%
summarise(lat = mean(latitude), long = mean(longitude))
final_res = merge(final_res,zip_longlat,by='zip_f')
#== plot map ==#
library(ggmap)
price_range = range(query_price)
final_res$shift = 0.01*as.integer(final_res$home.type)-0.015
map <- get_map(c(-122.4,47.7), zoom = 11,maptype="roadmap")
Before we can ask any real question, let’s confirm on this belief: the price IS INCREASING.
mod_lm5= update(mod_lm3,.~.-sale.date.int-I(sale.date.int^2))
plot(training$sale.date.int,resid(mod_lm5))
abline(lm(resid(mod_lm5)~training$sale.date.int),col='green',lwd=2)
lines(lowess(resid(mod_lm5)~training$sale.date.int),col='red',lwd=2)
This is the trend after adjusting for other variables, and we are looking at the relationship between the price (residuals, technically speaking) and property sale date. Although the unit has been scaled, we can still observe a positive trend. In fact, if we compare the locally weighted smoothing line in red against a linear fitting in green, we see this price increase is even faster than a linear growth. However, the red line is nearly flat in the first 1/3 period of time within the investigated time interval (2010-08-06 to 2016-01-27). This tells us this speedup in the price growth rate happened only recently.
ggmap(map) +
geom_point(aes(x = long+shift, y = lat, fill = query_price,group=home.type,shape=home.type),data = subset(final_res,sale.date.int==as.POSIXct('2010-1-1')-as.POSIXct('2009-1-1')),
alpha = .8,colour='black',size=4)+
scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) +
scale_fill_gradientn(colours = topo.colors(10),limits=price_range,
labels=c("200k","400k","600k","800k","1M")) +
ggtitle("Predicted property price on 2010-1-1") +
xlab("Longitude") +
ylab("Latitude") +
theme(plot.title = element_text(lineheight=1, face="bold",size=20),
axis.text=element_text(size=10),axis.title=element_text(size=15),
legend.position=c(0,1),legend.justification=c(0,1),
legend.title=element_text(size=15),
legend.text=element_text(size=15),
legend.key.height=unit(2,"line"),
legend.box = "horizontal",
legend.box.just = "top")
We see a price distribution as expected, with the properties in the downtown area the most expensive and those getting cheaper as we move further north.
ggmap(map) +
geom_point(aes(x = long+shift, y = lat, fill = query_price,group=home.type,shape=home.type),data = subset(final_res,sale.date.int==as.POSIXct('2016-1-1')-as.POSIXct('2009-1-1')),
alpha = .8,colour='black',size=4)+
scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) +
scale_fill_gradientn(colours = topo.colors(10),limits=price_range,
labels=c("200k","400k","600k","800k","1M")) +
ggtitle("Predicted property price on 2016-1-1") +
xlab("Longitude") +
ylab("Latitude") +
theme(plot.title = element_text(lineheight=1, face="bold",size=20),
axis.text=element_text(size=10),axis.title=element_text(size=15),
legend.position=c(0,1),legend.justification=c(0,1),
legend.title=element_text(size=15),
legend.text=element_text(size=15),
legend.key.height=unit(2,"line"),
legend.box = "horizontal",
legend.box.just = "top")
price_year_chg = final_res %>%
select(zip_f,home.type,sale.date.int,shift,query_price,long,lat) %>%
filter(sale.date.int %in% as.integer(as.POSIXct(c('2016-1-1','2015-1-1'))-as.POSIXct('2009-1-1'))) %>%
arrange(zip_f,home.type,sale.date.int) %>%
mutate(price_lag = lag(query_price),diff_price = (query_price-price_lag), diff_price_prop = diff_price/price_lag) %>%
filter(sale.date.int == as.POSIXct('2016-1-1')-as.POSIXct('2009-1-1'))
price_chg_range = range(price_year_chg$diff_price)
ggmap(map) +
geom_point(aes(x = long+shift, y = lat, fill = diff_price,group=home.type,shape=home.type),
data = price_year_chg,alpha = .8,colour='black',size=4)+
scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) +
scale_fill_gradientn(colours = topo.colors(10),limits=price_chg_range,name="Price_diff") +
ggtitle("Price change between 2015-1-1 and 2016-1-1") +
xlab("Longitude") +
ylab("Latitude") +
theme(plot.title = element_text(lineheight=1, face="bold",size=20),
axis.text=element_text(size=10),axis.title=element_text(size=15),
legend.position=c(0,1),legend.justification=c(0,1),
legend.title=element_text(size=15),
legend.text=element_text(size=15),
legend.key.height=unit(2,"line"),
legend.box = "horizontal",
legend.box.just = "top")
price_chg_range = range(price_year_chg$diff_price_prop*100)
ggmap(map) +
geom_point(aes(x = long+shift, y = lat, fill = diff_price_prop*100,group=home.type,shape=home.type),
data = price_year_chg,alpha = .8,colour='black',size=4)+
scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) +
scale_fill_gradientn(colours = topo.colors(10),limits=price_chg_range,name="Price_diff %") +
ggtitle("Price change percentage between 2015-1-1 and 2016-1-1") +
xlab("Longitude") +
ylab("Latitude") +
theme(plot.title = element_text(lineheight=1, face="bold",size=20),
axis.text=element_text(size=10),axis.title=element_text(size=15),
legend.position=c(0,1),legend.justification=c(0,1),
legend.title=element_text(size=15),
legend.text=element_text(size=15),
legend.key.height=unit(2,"line"),
legend.box = "horizontal",
legend.box.just = "top")
This is less informative in terms of the lack of difference in all areas. This is due to not including the interaction between zip code and sale date.
Now we look at the annual price change percentage.
price_year_chg = final_res %>%
select(zip_f,home.type,sale.date.int,shift,query_price,long,lat) %>%
arrange(zip_f,home.type,sale.date.int) %>%
mutate(price_lag = lag(query_price),diff_price = (query_price-price_lag), diff_price_prop = diff_price/price_lag) %>%
filter(zip_f==98036,sale.date.int!=365)
library(scales) #for percent in the next plot
xlab_date <- function(x){
as.Date('2009-1-1')+x
}
ggplot(aes(x=sale.date.int,y=diff_price_prop,group=home.type,color=home.type),data=price_year_chg) +
geom_point() +
geom_line() +
ggtitle("Predicted annual price change") +
scale_x_continuous(name="Sale date",labels = xlab_date,breaks = unique(price_year_chg$sale.date.int)) +
scale_y_continuous(name="Price_diff %",labels = percent) +
theme(plot.title = element_text(lineheight=1, face="bold",size=20),
axis.text=element_text(size=10),axis.title=element_text(size=15),
legend.position=c(0,1),legend.justification=c(0,1),
legend.title=element_text(size=15),
legend.text=element_text(size=15),
legend.key.height=unit(2,"line"),
legend.box = "horizontal",
legend.box.just = "top")
We see that despite of the fact that condos are the cheapest properties with the same property parameters, their price growth rate was the lowest before 2013, and exceeded that of townhouses and eventually single family houses in 2014~2015. Note that the linearity in increasing growth rate comes from the 2st degree polynomial in our ‘mod_lm3’ model.
A model was built to analyze/estimate the property prices in areas around Seattle. This model, although not complicated, revealed the recent situations in the real estates market. We observed a trend of growth in property prices, and the distribution of this trend in different areas was also estimated.
This model provided an insight in the price growth for three property types: single family house, townhouse and condo. People can use this information to help determine which type meets their requirements and preferences.
The analysis was bases on a linear model. I can also implement more complicated approaches such as elastic net, support vector regression, etc. However, according to previous experiences, the improvement is unlikely to be remarkable.
Current analysis was only based on information available on Redfin.com. There are apparently other important factors that may dramatically impact the price of a given property, e.g. recent remodeling/refurbishment, good schools nearby, etc.
In short, if you want to own your own property in Seattle and have the money, probably you should do it NOW!