The goal of this project is to use house specifics to predict its market price, using data downloaded from Redfin.com.

  1. Data source
    The price data were for properties in a selected area in Seattle, WA, that were sold on the market within the past 2000 days (from 2010-08-06 to 2016-01-27). House types of interest are single family house, townhouse and condo, with a size in square feet between 700 to 2000. The area of interest is a rectangular zone near Seattle, from 47.6 to 47.78 degrees for latitude and from -122.37 to -122.27 degrees for longitude. The data was downloaded repeated from links like this. The website applies an upper limit for the number of records shown in a single page, therefore I implemented an automated process using RSelenium to download all the data, and the code can be viewed here.

  2. Prediction model
    Outcome: sale price
    Explanatory variables: home type, zip code, # bedrooms, # bathrooms, square feet, lot size, year built, parking spots, parking type
    Final models can have subsets of the given explanatory variables with various transformations.
    The metric used to evaluate model performance are mean squared error (MSE) and \(R^2\). I used a 90%/10% training/testing data split.

Load data

library(dplyr)
library(ggplot2)
library(car)

#== Load data ==#

#data downloaed from Redfin
prop_dat <- read.csv("./data_redfin/seattle_2000day_2b_700min_2000max_160125.csv")

names(prop_dat) <- tolower(names(prop_dat))

table(prop_dat$home.type)
## 
##                Condo/Coop   Multi-Family (2-4 Unit) 
##                      3450                        10 
##    Multi-Family (5+ Unit) Single Family Residential 
##                         3                      5393 
##                 Townhouse               Vacant Land 
##                      2295                         2
#hist(prop_dat$original.list.price-prop_dat$list.price)

Clean data

#== Cleaning ==#

#select variables of interest
#make a factorized zipcode, lot yes/no, continuous sale date as integer (2009-1-1 as origin)
property_sub <- prop_dat %>%
    filter(home.type %in% c('Single Family Residential','Townhouse','Condo/Coop')) %>%
    mutate(zip_f = as.factor(zip),sale.date.int = as.integer(as.POSIXct(last.sale.date)-as.POSIXct('2009-1-1'))) %>%
    select(home.type,zip_f,beds,baths,location,sqft,lot.size,year.built,parking.spots,
           parking.type,latitude,longitude,sale.date.int,list.price,original.list.price,last.sale.price)

property_sub <- droplevels(property_sub)
levels(property_sub$home.type)[2] <- "Single Family"
table(property_sub$zip_f)
## 
## 98010 98013 98019 98021 98027 98028 98036 98043 98101 98102 98103 98104 
##     1     1     1     2     1     9    31    42   317   599  1798   113 
## 98105 98107 98109 98111 98112 98113 98114 98115 98116 98117 98119 98121 
##   527   315   545     1   565     1     2  1332     2   623   544   310 
## 98122 98125 98127 98128 98133 98144 98146 98155 98164 98177 98199 98215 
##   783  1126     1     1   941   195     1   198     3   204     1     1 
## 99107 
##     1
table(prop_dat$location)
## 
##                  98021                  98027                  98028 
##                      1                      1                      4 
##                  98036                  98043                  98101 
##                      8                      9                    147 
##                  98102                  98103                  98104 
##                    252                    787                     54 
##                  98105                  98107                  98109 
##                    262                    136                    218 
##                  98112                  98115                  98117 
##                    257                    598                    234 
##                  98119                  98121                  98122 
##                    222                    140                    308 
##                  98125                  98133                  98144 
##                    452                    351                     69 
##                  98155                  98164                  98177 
##                     74                      2                     92 
##              Arboretum         Aurora Village                Ballard 
##                      6                      3                    278 
##              Ballinger               Belltown      Belvedere Terrace 
##                      1                    139                      7 
##            Bitter Lake             Briarcrest                  Brier 
##                     81                      7                     16 
##              Broadview                 Bryant                 Burien 
##                    144                     99                      1 
##           Capitol Hill             Cedar Park           Central Area 
##                    434                     39                    232 
##        Crestview Hills             Crown Hill           Denny Blaine 
##                      7                     89                      3 
##         Denny Triangle               Downtown             East Union 
##                      7                    135                      1 
##               Eastlake              Echo Lake               Firlands 
##                    118                     49                      1 
##             First Hill                Fremont             Green Lake 
##                    106                    215                    486 
##              Greenwood            Haller Lake        Hawthorne Hills 
##                    370                    144                     32 
##               Hillwood            Innis Arden               Interbay 
##                      4                      1                      2 
## International District           Jackson Park                Judkins 
##                      3                     66                     24 
##                Kenmore              Lake City       Lake Forest Park 
##                      1                     82                     13 
##             Lake Union            Laurelhurst                 Leschi 
##                     11                     23                     52 
##           Madison Park         Madison Valley                Madrona 
##                     72                    168                     37 
##             Maple Leaf         Matthews Beach            Meadowbrook 
##                    209                     27                     87 
##          Meridian Park               Montlake      Mountlake Terrace 
##                      3                     19                     32 
##     North Capitol Hill             North City              Northgate 
##                     32                      4                    141 
##          Olympic Hills         Paramount Park          Phinney Ridge 
##                    116                      1                    114 
##              Pinehurst         Pioneer Square            Portage Bay 
##                    129                     15                      7 
##             Queen Anne                Ravenna         Richmond Beach 
##                    598                    127                      5 
##     Richmond Highlands             Ridgecrest              Roosevelt 
##                     28                     27                     31 
##             Sand Point   Se Mountlake Terrace                Seattle 
##                     77                      1                    134 
##              Shoreline       South Lake Union    University District 
##                    108                     24                     68 
##                 Uplake        Victory Heights             View Ridge 
##                      1                     36                     36 
##            Wallingford        Washington Park               Wedgwood 
##                    225                      2                    131 
##               Westlake               Whittier             Windermere 
##                     17                     42                     10 
##          Woodland Park 
##                      1

It seems zip code is a more accurate surrogate for location, given that the variable ‘location’ also includes levels like zip code.

#remove records with <10 obs of zip for more stable inference
in_zip <- property_sub %>% count(zip_f) %>% filter(n>=10) %>% select(zip_f) %>% unlist
property_sub <- filter(property_sub,zip_f %in% in_zip)


#price
hist(property_sub$last.sale.price)

tally(group_by(property_sub,home.type,last.sale.price)) %>%
    group_by(home.type) %>% top_n(6,last.sale.price)
## Source: local data frame [18 x 3]
## Groups: home.type [3]
## 
##        home.type last.sale.price     n
##           (fctr)           (dbl) (int)
## 1     Condo/Coop         2250000     1
## 2     Condo/Coop         2392000     1
## 3     Condo/Coop         2400000     1
## 4     Condo/Coop         2425000     1
## 5     Condo/Coop         2600000     1
## 6     Condo/Coop         2700000     1
## 7  Single Family         1295000     2
## 8  Single Family         1380000     1
## 9  Single Family         1400000     2
## 10 Single Family         1525000     1
## 11 Single Family         1550000     1
## 12 Single Family         2800000     1
## 13     Townhouse          939950     2
## 14     Townhouse          960000     1
## 15     Townhouse          980000     1
## 16     Townhouse         1000000     1
## 17     Townhouse         1175000     1
## 18     Townhouse         1200000     1

There seems to be a single family house with price much higher than others. We’d like to remove this potential influential data point.

property_sub <- filter(property_sub,last.sale.price<2000000)

#beds and baths
table(property_sub$beds)
## 
##    0    1    2    3    4    5    6    8    9   12 
##    4   41 5473 4697  768   74    6    1    2    2
table(property_sub$baths)
## 
##  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5    4 4.25 
##    1   24 2730   19 1158 2249 2148  895 1146   93  243  216  123    1    1 
##  4.5    8   10 
##    4    1    1
#only keep data with explanatory variables well sampled
property_sub <- filter(property_sub,beds<=5,beds>=1,baths<4,baths>=0.75)


#square feet
hist(property_sub$sqft)

range(property_sub$sqft,na.rm = T)
## [1]  430 6680
property_sub <- filter(property_sub,property_sub$sqft>=700,property_sub$sqft<=2000)


#lot.size
range(property_sub$lot.size,na.rm = T)
## [1]       1 1253661
boxplot(property_sub$lot.size)

op = par(mfrow = c(2,2))
hist(property_sub$lot.size)
hist(property_sub$lot.size[property_sub$home.type=="Single Family"])
hist(property_sub$lot.size[property_sub$home.type=='Townhouse'])
hist(property_sub$lot.size[property_sub$home.type=='Condo/Coop'])

par(op)

#the 6 smallest and largest lot grouped by home type
tally(group_by(property_sub,home.type,lot.size)) %>%
    group_by(home.type) %>% top_n(6,desc(lot.size))
## Source: local data frame [18 x 3]
## Groups: home.type [3]
## 
##        home.type lot.size     n
##           (fctr)    (int) (int)
## 1     Condo/Coop        5     1
## 2     Condo/Coop      864     1
## 3     Condo/Coop     2589     1
## 4     Condo/Coop     3200     1
## 5     Condo/Coop     3300     1
## 6     Condo/Coop     3348     2
## 7  Single Family      809     1
## 8  Single Family      979     1
## 9  Single Family      983     1
## 10 Single Family      984     1
## 11 Single Family      991     4
## 12 Single Family     1000     3
## 13     Townhouse        1    14
## 14     Townhouse      500     4
## 15     Townhouse      600     3
## 16     Townhouse      640     3
## 17     Townhouse      651     1
## 18     Townhouse      669     1
tally(group_by(property_sub,home.type,lot.size)) %>%
    group_by(home.type) %>% top_n(6,lot.size)
## Source: local data frame [18 x 3]
## Groups: home.type [3]
## 
##        home.type lot.size     n
##           (fctr)    (int) (int)
## 1     Condo/Coop   261360     1
## 2     Condo/Coop   292300     4
## 3     Condo/Coop   361500     2
## 4     Condo/Coop   392040     1
## 5     Condo/Coop   400921    11
## 6     Condo/Coop  1253661     1
## 7  Single Family    25254     1
## 8  Single Family    27075     1
## 9  Single Family    29185     1
## 10 Single Family    31878     1
## 11 Single Family    42084     1
## 12 Single Family    47916     1
## 13     Townhouse    36500     1
## 14     Townhouse    36865     1
## 15     Townhouse    43560     1
## 16     Townhouse    47075     5
## 17     Townhouse    47826     2
## 18     Townhouse    61448     1
property_sub$lot.size[property_sub$lot.size %in% c(1,5)] <- NA
property_sub$lot.size[property_sub$lot.size == 1253661] <- 400921    #winsorize

#impute lot size for missing data
property_sub = property_sub %>% group_by(home.type) %>% 
    mutate(lot_median = median(lot.size,na.rm=T),lot.size.imp = ifelse(is.na(lot.size),lot_median,lot.size))

hist(property_sub$lot.size.imp)

#try log transformation
hist(log(property_sub$lot.size.imp))

#year built
hist(property_sub$year.built)

range(property_sub$year.built,na.rm = T)
## [1] 1051 2015
sort(property_sub$year.built)[1:10]
##  [1] 1051 1651 1891 1893 1896 1900 1900 1900 1900 1900
property_sub <- filter(property_sub,year.built>=1900)


#parking.spots
table(property_sub$parking.spots,exclude = NULL)
## 
##    0    1    2    3    4    5    6  154  200  240  242  260  302 <NA> 
##   49 5194 1100   39   10    1    2    1    1    1    1    1    1 4541
property_sub$parking.spots[is.na(property_sub$parking.spots)] <- 0
property_sub <- filter(property_sub,parking.spots<=4)    #remove records with >4 parking spots

#parking type
table(property_sub$parking.type,exclude = NULL)
## 
##        Garage   <NA> 
##   6606   4301     26
property_sub$parking.type[is.na(property_sub$parking.type)] <- ""
levels(property_sub$parking.type)[1] <- 'other'    #non-garage parking


#latitude & longtitude
sum(is.na(property_sub$latitude))
## [1] 1
sum(is.na(property_sub$longitude))
## [1] 1
property_sub <- filter(property_sub,!is.na(latitude),!is.na(longitude))

range(property_sub$latitude)
## [1] 47.59631 47.78353
range(property_sub$longitude)
## [1] -122.3762 -122.2639
#seems both ok


#create tranformed variables
property_sub <- property_sub %>%
    mutate(log_lotsize = log(lot.size.imp))


#drop unused levels
property_sub <- droplevels(property_sub)

property_sub <- select(property_sub,
                       last.sale.price,home.type,zip_f,beds,baths,year.built,sale.date.int,
                       parking.spots,parking.type,sale.date.int,sqft,log_lotsize,latitude,longitude)
property_sub = property_sub[complete.cases(property_sub),]

Exploratory analysis

hist(property_sub$last.sale.price)

Try log transformation since the the distribution is skewed to the right.

hist(log(property_sub$last.sale.price))

#plot only a sub-sample to save time
set.seed(123)
pairs(sample_n(ungroup(property_sub),1000) %>% select(last.sale.price,beds,baths,year.built,parking.spots,sqft,log_lotsize,sale.date.int))

We can observe non-linear relationships between price and year.built, sqft (square feet), sale.date.int. The relationship with other variables are not obvious.

plot(property_sub$year.built,log(property_sub$last.sale.price))
lines(lowess(property_sub$year.built,log(property_sub$last.sale.price)), col = 2)

qplot(home.type,last.sale.price,data=property_sub,geom=c("boxplot"))

qplot(parking.type,last.sale.price,data=property_sub,geom=c("boxplot"))

Build model

#split data into training and testing sets
set.seed(123)
n = dim(property_sub)[1]
inTrain = sample(1:n,size = round(0.9*n))

training = property_sub[inTrain,]

#check home type distribution
home_type_frac = table(property_sub$home.type)
home_type_frac/sum(home_type_frac)
## 
##    Condo/Coop Single Family     Townhouse 
##     0.3100073     0.4819795     0.2080132
home_type_frac = table(training$home.type)
home_type_frac/sum(home_type_frac)    #approximately the same proportions
## 
##    Condo/Coop Single Family     Townhouse 
##     0.3093810     0.4832808     0.2073381
#Normalize continuous variables. Note testing data use the same center and scale as training data.

train_scale = scale(training[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")])
training[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")] <- train_scale
y_train = scale(training$last.sale.price)

We’d expect that there are interactions between the type of property and other variables including primarily sqft, year.built, sale.date.int. Create extra variables so that the crPlots can be drawn for diagnostics.

training <- mutate(training,single_sqft = sqft*(home.type=="Single Family"),
                       th_sqft = sqft*(home.type=="Townhouse"),
                       single_yearb = year.built*(home.type=="Single Family"),
                       th_yearb = year.built*(home.type=="Townhouse"),
                       single_saledate = sale.date.int*(home.type=="Single Family"),
                       th_saledate = sale.date.int*(home.type=="Townhouse"))

First fit a standard linear model.

mod_lm1= lm(y_train~factor(home.type)+zip_f+beds+baths+sqft+log_lotsize+year.built+sale.date.int+parking.spots+factor(parking.type)+single_sqft+th_sqft+single_yearb+th_yearb+single_saledate+th_saledate,data = training)
summary(mod_lm1)
## 
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths + 
##     sqft + log_lotsize + year.built + sale.date.int + parking.spots + 
##     factor(parking.type) + single_sqft + th_sqft + single_yearb + 
##     th_yearb + single_saledate + th_saledate, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1432 -0.3234 -0.0265  0.2620  4.8222 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -0.910263   0.125283  -7.266 4.00e-13 ***
## factor(home.type)Single Family  0.244148   0.032781   7.448 1.03e-13 ***
## factor(home.type)Townhouse     -0.380892   0.060702  -6.275 3.65e-10 ***
## zip_f98043                     -0.290287   0.153092  -1.896 0.057968 .  
## zip_f98101                      2.597296   0.127322  20.399  < 2e-16 ***
## zip_f98102                      1.443823   0.123859  11.657  < 2e-16 ***
## zip_f98103                      0.831240   0.121484   6.842 8.26e-12 ***
## zip_f98104                      1.494876   0.136562  10.946  < 2e-16 ***
## zip_f98105                      0.892983   0.123628   7.223 5.46e-13 ***
## zip_f98107                      0.814744   0.125881   6.472 1.01e-10 ***
## zip_f98109                      1.184760   0.123920   9.561  < 2e-16 ***
## zip_f98112                      1.352922   0.123305  10.972  < 2e-16 ***
## zip_f98115                      0.646404   0.121222   5.332 9.91e-08 ***
## zip_f98117                      0.612297   0.122967   4.979 6.49e-07 ***
## zip_f98119                      1.204477   0.123767   9.732  < 2e-16 ***
## zip_f98121                      2.286939   0.127093  17.994  < 2e-16 ***
## zip_f98122                      0.933487   0.122718   7.607 3.07e-14 ***
## zip_f98125                      0.173769   0.120986   1.436 0.150956    
## zip_f98133                      0.146861   0.121335   1.210 0.226164    
## zip_f98144                      0.686992   0.128858   5.331 9.96e-08 ***
## zip_f98155                     -0.038474   0.127149  -0.303 0.762207    
## zip_f98177                      0.422386   0.127296   3.318 0.000909 ***
## beds                           -0.027558   0.008172  -3.372 0.000748 ***
## baths                           0.034910   0.008903   3.921 8.87e-05 ***
## sqft                            0.784003   0.013799  56.815  < 2e-16 ***
## log_lotsize                    -0.022184   0.015152  -1.464 0.143214    
## year.built                      0.250440   0.018898  13.252  < 2e-16 ***
## sale.date.int                   0.257464   0.012053  21.361  < 2e-16 ***
## parking.spots                   0.012624   0.009226   1.368 0.171276    
## factor(parking.type)Garage      0.006784   0.017304   0.392 0.695006    
## single_sqft                    -0.534494   0.015992 -33.423  < 2e-16 ***
## th_sqft                        -0.382604   0.022735 -16.829  < 2e-16 ***
## single_yearb                   -0.259158   0.022112 -11.720  < 2e-16 ***
## th_yearb                        0.068235   0.047870   1.425 0.154068    
## single_saledate                -0.022775   0.014214  -1.602 0.109119    
## th_saledate                     0.048947   0.017654   2.773 0.005572 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6026 on 9803 degrees of freedom
## Multiple R-squared:  0.6382, Adjusted R-squared:  0.6369 
## F-statistic:   494 on 35 and 9803 DF,  p-value: < 2.2e-16
op = par(mfrow = c(2,2))
plot(mod_lm1)

par(op)

From the 1st and the 3rd plots it appears the variance is not homogeneous. The 2nd plots shows that residual is not normal. The 4th plot tells us that there are no quite influential data points so we are okay.

qqPlot(mod_lm1, main="QQ Plot") #qq plot for studentized resid 

Again the residual is not normally distributed.

spreadLevelPlot(mod_lm1)

## 
## Suggested power transformation:  0.8234643

The residual variance increases as the mean values increase, which is reflected by a red line with a positive slope/trend. We can try to use transformed outcome variable (either square root or log transformation).

crPlots(mod_lm1)

We see there are nonlinear trends between the outcome variable with sqft, year.built, sale.date.int and log_lotsize. Given the shapes of nonlinearity, we can transform sqft and use polynomials for the other two variables. We don’t have much to do with log_lotsize, because a great proportion of observations for this variable were imputed.

Then we test the performance of this naive model.

#Normalize continuous variables for testing data. Note testing data use the same center and scale as training data.

testing = property_sub[-inTrain,]
testing[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")] = 
    scale(testing[,c("beds","baths","year.built","parking.spots","sqft","log_lotsize","sale.date.int")],
          center = attr(train_scale,'scaled:center'),
          scale = attr(train_scale,'scaled:scale'))
y_test = scale(testing$last.sale.price,
               center = attr(y_train,'scaled:center'),
               scale = attr(y_train,'scaled:scale'))

testing <- mutate(testing,single_sqft = sqft*(home.type=="Single Family"),
                       th_sqft = sqft*(home.type=="Townhouse"),
                       single_yearb = year.built*(home.type=="Single Family"),
                       th_yearb = year.built*(home.type=="Townhouse"),
                       single_saledate = sale.date.int*(home.type=="Single Family"),
                       th_saledate = sale.date.int*(home.type=="Townhouse"))


y_pred_lm = predict(mod_lm1,newdata = testing)

mse_lm = sum((y_pred_lm-y_test)^2)/length(y_test)

plot(y_test,y_pred_lm,xlab="True normalized price",ylab="Predicted normalized price",
     main="Prediction using naive linear regression")
text(4,-1,substitute(r^2 == r2,list(r2=cor(y_test,y_pred_lm))),adj=0)
text(4,-1.5,substitute(MSE == r2,list(r2=mse_lm)),adj=0)
abline(0,1)

Make transformed variables.

property_sub = property_sub %>% 
    mutate(sqrt_sqft = sqrt(sqft),sqrt_price=sqrt(last.sale.price),log_price=log(last.sale.price))
training = property_sub[inTrain,]
train_scale = scale(training[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")])
training[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")] <- train_scale

training <- mutate(training,single_sqft = sqrt_sqft*(home.type=="Single Family"),
                       th_sqft = sqrt_sqft*(home.type=="Townhouse"),
                       single_yearb = year.built*(home.type=="Single Family"),
                       th_yearb = year.built*(home.type=="Townhouse"),
                       single_yearb2 = year.built^2*(home.type=="Single Family"),
                       th_yearb2 = year.built^2*(home.type=="Townhouse"),
                       single_saledate = sale.date.int*(home.type=="Single Family"),
                       th_saledate = sale.date.int*(home.type=="Townhouse"),
                       single_saledate2 = sale.date.int^2*(home.type=="Single Family"),
                       th_saledate2 = sale.date.int^2*(home.type=="Townhouse"))

The 2nd model.

Test the updated explanatory variables. Try square root transformed outcome variable.

y_train = scale(training$sqrt_price)
mod_lm2= lm(y_train~factor(home.type)+zip_f+beds+baths+sqrt_sqft+log_lotsize+year.built+I(year.built^2)+sale.date.int+I(sale.date.int^2)+parking.spots+factor(parking.type)+single_sqft+th_sqft+single_yearb+th_yearb+single_saledate+th_saledate+single_yearb2+th_yearb2+single_saledate2+th_saledate2,data = training)
summary(mod_lm2)
## 
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths + 
##     sqrt_sqft + log_lotsize + year.built + I(year.built^2) + 
##     sale.date.int + I(sale.date.int^2) + parking.spots + factor(parking.type) + 
##     single_sqft + th_sqft + single_yearb + th_yearb + single_saledate + 
##     th_saledate + single_yearb2 + th_yearb2 + single_saledate2 + 
##     th_saledate2, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1762 -0.3195 -0.0122  0.2881  3.6439 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.447592   0.117558 -12.314  < 2e-16 ***
## factor(home.type)Single Family  0.641675   0.035253  18.202  < 2e-16 ***
## factor(home.type)Townhouse     -0.233678   0.062696  -3.727 0.000195 ***
## zip_f98043                     -0.345120   0.142201  -2.427 0.015243 *  
## zip_f98101                      2.469077   0.118217  20.886  < 2e-16 ***
## zip_f98102                      1.586292   0.115008  13.793  < 2e-16 ***
## zip_f98103                      0.958805   0.112814   8.499  < 2e-16 ***
## zip_f98104                      1.439073   0.127329  11.302  < 2e-16 ***
## zip_f98105                      1.043115   0.114813   9.085  < 2e-16 ***
## zip_f98107                      0.959450   0.116894   8.208 2.54e-16 ***
## zip_f98109                      1.361063   0.115071  11.828  < 2e-16 ***
## zip_f98112                      1.507362   0.114510  13.164  < 2e-16 ***
## zip_f98115                      0.764960   0.112629   6.792 1.17e-11 ***
## zip_f98117                      0.730921   0.114193   6.401 1.62e-10 ***
## zip_f98119                      1.347182   0.114925  11.722  < 2e-16 ***
## zip_f98121                      2.299109   0.118073  19.472  < 2e-16 ***
## zip_f98122                      1.027653   0.113994   9.015  < 2e-16 ***
## zip_f98125                      0.206705   0.112447   1.838 0.066056 .  
## zip_f98133                      0.139894   0.112755   1.241 0.214748    
## zip_f98144                      0.732895   0.119707   6.122 9.57e-10 ***
## zip_f98155                     -0.079890   0.118262  -0.676 0.499356    
## zip_f98177                      0.490565   0.118432   4.142 3.47e-05 ***
## beds                           -0.023601   0.007646  -3.087 0.002028 ** 
## baths                           0.061434   0.008478   7.246 4.62e-13 ***
## sqrt_sqft                       0.668780   0.012700  52.658  < 2e-16 ***
## log_lotsize                     0.020895   0.014541   1.437 0.150753    
## year.built                      0.235724   0.017583  13.407  < 2e-16 ***
## I(year.built^2)                 0.298103   0.016470  18.100  < 2e-16 ***
## sale.date.int                   0.298438   0.011912  25.053  < 2e-16 ***
## I(sale.date.int^2)              0.073248   0.010409   7.037 2.10e-12 ***
## parking.spots                   0.028007   0.008639   3.242 0.001191 ** 
## factor(parking.type)Garage      0.049652   0.016215   3.062 0.002204 ** 
## single_sqft                    -0.410965   0.014887 -27.605  < 2e-16 ***
## th_sqft                        -0.294038   0.021601 -13.613  < 2e-16 ***
## single_yearb                   -0.241869   0.022732 -10.640  < 2e-16 ***
## th_yearb                       -0.182096   0.051445  -3.540 0.000403 ***
## single_saledate                -0.022013   0.014302  -1.539 0.123798    
## th_saledate                     0.029953   0.017704   1.692 0.090710 .  
## single_yearb2                  -0.283840   0.019782 -14.348  < 2e-16 ***
## th_yearb2                       0.156439   0.044498   3.516 0.000441 ***
## single_saledate2               -0.002811   0.013295  -0.211 0.832549    
## th_saledate2                    0.003742   0.016470   0.227 0.820272    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5595 on 9797 degrees of freedom
## Multiple R-squared:  0.6882, Adjusted R-squared:  0.6869 
## F-statistic: 527.5 on 41 and 9797 DF,  p-value: < 2.2e-16
op = par(mfrow = c(2,2))
plot(mod_lm2)

par(op)

The residual variance is better compared to the previous model, but is still heterogeneous.

spreadLevelPlot(mod_lm2)

## 
## Suggested power transformation:  0.8984082
crPlots(mod_lm2)

There is still nonlinearity.

The 3rd model.

Try log transformed outcome variable.

y_train = scale(training$log_price)
mod_lm3= lm(y_train~factor(home.type)+zip_f+beds+baths+sqrt_sqft+log_lotsize+year.built+I(year.built^2)+sale.date.int+I(sale.date.int^2)+parking.spots+factor(parking.type)+single_sqft+th_sqft+single_yearb+th_yearb+single_saledate+th_saledate+single_yearb2+th_yearb2+single_saledate2+th_saledate2,data = training)
summary(mod_lm3)
## 
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths + 
##     sqrt_sqft + log_lotsize + year.built + I(year.built^2) + 
##     sale.date.int + I(sale.date.int^2) + parking.spots + factor(parking.type) + 
##     single_sqft + th_sqft + single_yearb + th_yearb + single_saledate + 
##     th_saledate + single_yearb2 + th_yearb2 + single_saledate2 + 
##     th_saledate2, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8378 -0.3078  0.0155  0.3225  3.0200 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.582961   0.117273 -13.498  < 2e-16 ***
## factor(home.type)Single Family  0.819960   0.035167  23.316  < 2e-16 ***
## factor(home.type)Townhouse     -0.105759   0.062544  -1.691  0.09088 .  
## zip_f98043                     -0.354588   0.141856  -2.500  0.01245 *  
## zip_f98101                      2.250010   0.117930  19.079  < 2e-16 ***
## zip_f98102                      1.650969   0.114729  14.390  < 2e-16 ***
## zip_f98103                      1.008675   0.112541   8.963  < 2e-16 ***
## zip_f98104                      1.493065   0.127020  11.755  < 2e-16 ***
## zip_f98105                      1.099512   0.114535   9.600  < 2e-16 ***
## zip_f98107                      1.022400   0.116610   8.768  < 2e-16 ***
## zip_f98109                      1.448172   0.114792  12.616  < 2e-16 ***
## zip_f98112                      1.535062   0.114232  13.438  < 2e-16 ***
## zip_f98115                      0.777795   0.112355   6.923 4.71e-12 ***
## zip_f98117                      0.775766   0.113916   6.810 1.03e-11 ***
## zip_f98119                      1.391110   0.114646  12.134  < 2e-16 ***
## zip_f98121                      2.239273   0.117787  19.011  < 2e-16 ***
## zip_f98122                      1.073223   0.113717   9.438  < 2e-16 ***
## zip_f98125                      0.166449   0.112174   1.484  0.13788    
## zip_f98133                      0.056713   0.112481   0.504  0.61413    
## zip_f98144                      0.758124   0.119417   6.349 2.27e-10 ***
## zip_f98155                     -0.181391   0.117975  -1.538  0.12419    
## zip_f98177                      0.491341   0.118144   4.159 3.23e-05 ***
## beds                           -0.012920   0.007627  -1.694  0.09030 .  
## baths                           0.079520   0.008458   9.402  < 2e-16 ***
## sqrt_sqft                       0.626955   0.012670  49.485  < 2e-16 ***
## log_lotsize                     0.038600   0.014506   2.661  0.00780 ** 
## year.built                      0.215644   0.017540  12.294  < 2e-16 ***
## I(year.built^2)                 0.318308   0.016430  19.374  < 2e-16 ***
## sale.date.int                   0.314091   0.011883  26.431  < 2e-16 ***
## I(sale.date.int^2)              0.083347   0.010384   8.026 1.12e-15 ***
## parking.spots                   0.025562   0.008618   2.966  0.00302 ** 
## factor(parking.type)Garage      0.075649   0.016176   4.677 2.95e-06 ***
## single_sqft                    -0.379180   0.014851 -25.532  < 2e-16 ***
## th_sqft                        -0.278899   0.021548 -12.943  < 2e-16 ***
## single_yearb                   -0.240134   0.022677 -10.589  < 2e-16 ***
## th_yearb                       -0.138682   0.051320  -2.702  0.00690 ** 
## single_saledate                -0.029509   0.014267  -2.068  0.03864 *  
## th_saledate                     0.008770   0.017661   0.497  0.61949    
## single_yearb2                  -0.328192   0.019734 -16.630  < 2e-16 ***
## th_yearb2                       0.137642   0.044390   3.101  0.00194 ** 
## single_saledate2               -0.018470   0.013263  -1.393  0.16378    
## th_saledate2                   -0.018655   0.016430  -1.135  0.25621    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5582 on 9797 degrees of freedom
## Multiple R-squared:  0.6898, Adjusted R-squared:  0.6885 
## F-statistic: 531.3 on 41 and 9797 DF,  p-value: < 2.2e-16
op = par(mfrow = c(2,2))
plot(mod_lm3)

par(op)

Now the residual variance appears to be relatively constant.

spreadLevelPlot(mod_lm3)

## 
## Suggested power transformation:  0.9742098
crPlots(mod_lm3)

hist(residuals(mod_lm3))

All variable appear to have linear relationship. Now everything seems to be okay, probably except for the assumption of normality of the residuals. However, this deviation from normality caused by the heavy tails is relatively acceptable.

We may think of dropping the interactions between property type and the polynomials of sale date since they are not that significant.

The 4th model.

Drop the interactions between property type and the polynomials of sale date.

mod_lm4 = update(mod_lm3,.~.-single_saledate-th_saledate-single_yearb2-th_yearb2-single_saledate2-th_saledate2)
summary(mod_lm4)
## 
## Call:
## lm(formula = y_train ~ factor(home.type) + zip_f + beds + baths + 
##     sqrt_sqft + log_lotsize + year.built + I(year.built^2) + 
##     sale.date.int + I(sale.date.int^2) + parking.spots + factor(parking.type) + 
##     single_sqft + th_sqft + single_yearb + th_yearb, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7778 -0.3118  0.0211  0.3294  3.1602 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.4576612  0.1187851 -12.271  < 2e-16 ***
## factor(home.type)Single Family  0.6069790  0.0310559  19.545  < 2e-16 ***
## factor(home.type)Townhouse     -0.0522211  0.0579697  -0.901   0.3677    
## zip_f98043                     -0.3275019  0.1443425  -2.269   0.0233 *  
## zip_f98101                      2.2825436  0.1200048  19.020  < 2e-16 ***
## zip_f98102                      1.6698454  0.1167753  14.300  < 2e-16 ***
## zip_f98103                      0.9947526  0.1145377   8.685  < 2e-16 ***
## zip_f98104                      1.6523231  0.1289208  12.817  < 2e-16 ***
## zip_f98105                      1.0956247  0.1165730   9.399  < 2e-16 ***
## zip_f98107                      1.0334297  0.1186798   8.708  < 2e-16 ***
## zip_f98109                      1.4385161  0.1168369  12.312  < 2e-16 ***
## zip_f98112                      1.5171898  0.1162532  13.051  < 2e-16 ***
## zip_f98115                      0.7907614  0.1143560   6.915 4.97e-12 ***
## zip_f98117                      0.7776796  0.1159517   6.707 2.10e-11 ***
## zip_f98119                      1.3816157  0.1166804  11.841  < 2e-16 ***
## zip_f98121                      2.2732885  0.1198519  18.967  < 2e-16 ***
## zip_f98122                      1.0562388  0.1157204   9.128  < 2e-16 ***
## zip_f98125                      0.1876836  0.1141706   1.644   0.1002    
## zip_f98133                      0.0784270  0.1144848   0.685   0.4933    
## zip_f98144                      0.7336307  0.1215223   6.037 1.63e-09 ***
## zip_f98155                     -0.1255348  0.1200463  -1.046   0.2957    
## zip_f98177                      0.5585016  0.1201921   4.647 3.42e-06 ***
## beds                            0.0007117  0.0077209   0.092   0.9266    
## baths                           0.0553488  0.0085008   6.511 7.83e-11 ***
## sqrt_sqft                       0.6644254  0.0126360  52.582  < 2e-16 ***
## log_lotsize                     0.0592193  0.0146260   4.049 5.19e-05 ***
## year.built                      0.2329190  0.0177956  13.089  < 2e-16 ***
## I(year.built^2)                 0.1189599  0.0099222  11.989  < 2e-16 ***
## sale.date.int                   0.3026725  0.0073507  41.176  < 2e-16 ***
## I(sale.date.int^2)              0.0710626  0.0059503  11.943  < 2e-16 ***
## parking.spots                   0.0165312  0.0087415   1.891   0.0586 .  
## factor(parking.type)Garage      0.0831364  0.0163377   5.089 3.67e-07 ***
## single_sqft                    -0.4236719  0.0148532 -28.524  < 2e-16 ***
## th_sqft                        -0.2806124  0.0214869 -13.060  < 2e-16 ***
## single_yearb                   -0.1393480  0.0223775  -6.227 4.95e-10 ***
## th_yearb                        0.0842578  0.0454266   1.855   0.0637 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5682 on 9803 degrees of freedom
## Multiple R-squared:  0.6783, Adjusted R-squared:  0.6772 
## F-statistic: 590.6 on 35 and 9803 DF,  p-value: < 2.2e-16
AIC(mod_lm3)
## [1] 16491.25
AIC(mod_lm4)
## [1] 16835.45
anova(mod_lm3,mod_lm4)
## Analysis of Variance Table
## 
## Model 1: y_train ~ factor(home.type) + zip_f + beds + baths + sqrt_sqft + 
##     log_lotsize + year.built + I(year.built^2) + sale.date.int + 
##     I(sale.date.int^2) + parking.spots + factor(parking.type) + 
##     single_sqft + th_sqft + single_yearb + th_yearb + single_saledate + 
##     th_saledate + single_yearb2 + th_yearb2 + single_saledate2 + 
##     th_saledate2
## Model 2: y_train ~ factor(home.type) + zip_f + beds + baths + sqrt_sqft + 
##     log_lotsize + year.built + I(year.built^2) + sale.date.int + 
##     I(sale.date.int^2) + parking.spots + factor(parking.type) + 
##     single_sqft + th_sqft + single_yearb + th_yearb
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   9797 3052.2                                  
## 2   9803 3164.7 -6   -112.52 60.196 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since model 3 has remarkably smaller AIC and ANOVA shows significant difference, we will not drop those interaction terms.

vif(mod_lm3)    #test for collinearity
##                            GVIF Df GVIF^(1/(2*Df))
## factor(home.type)    109.749257  2        3.236685
## zip_f                  3.352210 19        1.032344
## beds                   1.836947  1        1.355340
## baths                  2.258940  1        1.502977
## sqrt_sqft              5.069024  1        2.251449
## log_lotsize            6.644626  1        2.577717
## year.built             9.715114  1        3.116908
## I(year.built^2)        6.305897  1        2.511154
## sale.date.int          4.459425  1        2.111735
## I(sale.date.int^2)     3.810838  1        1.952137
## parking.spots          2.345446  1        1.531485
## factor(parking.type)   1.973412  1        1.404782
## single_sqft            3.658474  1        1.912714
## th_sqft                1.757016  1        1.325525
## single_yearb           6.911675  1        2.629006
## th_yearb              16.162303  1        4.020237
## single_saledate        3.139390  1        1.771832
## th_saledate            1.985659  1        1.409134
## single_yearb2         10.480757  1        3.237400
## th_yearb2             15.523755  1        3.940020
## single_saledate2       4.430656  1        2.104912
## th_saledate2           3.244182  1        1.801161

There does not appear to be any problem of collinearity.

Then we test the performance of this improved model.

#Normalize continuous variables for testing data. Note testing data use the same center and scale as training data.

testing = property_sub[-inTrain,]
testing[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")] = 
    scale(testing[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")],
          center = attr(train_scale,'scaled:center'),
          scale = attr(train_scale,'scaled:scale'))
y_test = scale(testing$log_price,
               center = attr(y_train,'scaled:center'),
               scale = attr(y_train,'scaled:scale'))

testing <- mutate(testing,single_sqft = sqrt_sqft*(home.type=="Single Family"),
                       th_sqft = sqrt_sqft*(home.type=="Townhouse"),
                       single_yearb = year.built*(home.type=="Single Family"),
                       th_yearb = year.built*(home.type=="Townhouse"),
                       single_yearb2 = year.built^2*(home.type=="Single Family"),
                       th_yearb2 = year.built^2*(home.type=="Townhouse"),
                       single_saledate = sale.date.int*(home.type=="Single Family"),
                       th_saledate = sale.date.int*(home.type=="Townhouse"),
                       single_saledate2 = sale.date.int^2*(home.type=="Single Family"),
                       th_saledate2 = sale.date.int^2*(home.type=="Townhouse"))

y_pred_lm = predict(mod_lm3,newdata = testing)

mse_lm = sum((y_pred_lm-y_test)^2)/length(y_test)

plot(y_test,y_pred_lm,xlab="True normalized price",ylab="Predicted normalized price",
     main="Prediction using improved linear model")
text(-5,2,substitute(r^2 == r2,list(r2=cor(y_test,y_pred_lm))),adj=0)
text(-5,1.7,substitute(MSE == r2,list(r2=mse_lm)),adj=0)
abline(0,1)

Prediction

Now we can use this model to answer questions about property prices in areas around Seattle. Say we are interested in properties with specifics as following

We can predict the expected prices by property type and location(zip code).

#== prediction ==#

zip_grid = unique(property_sub$zip_f)
home_grid = c('Single Family','Townhouse','Condo/Coop')
date_grid = as.integer(as.POSIXct(c('2010-1-1','2011-1-1','2012-1-1','2013-1-1','2014-1-1','2015-1-1','2016-1-1'))-as.POSIXct('2009-1-1'))
bed_grid = 3
bath_grid = 2
yearb_grid = 2000
sqft_grid = sqrt(1200)
lot_grid = median(property_sub$log_lotsize)
parking_grid = median(property_sub$parking.spots)
parking_type = 'Garage'

par_mat = expand.grid(zip_grid,home_grid,date_grid,bed_grid,bath_grid,yearb_grid,sqft_grid,lot_grid,parking_grid,parking_type)
names(par_mat) = c("zip_f","home.type","sale.date.int","beds","baths","year.built","sqrt_sqft","log_lotsize","parking.spots","parking.type")

par_mat_std = par_mat
par_mat_std[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")] = 
    scale(par_mat_std[,c("beds","baths","year.built","parking.spots","sqrt_sqft","log_lotsize","sale.date.int")],
          center = attr(train_scale,'scaled:center'),
          scale = attr(train_scale,'scaled:scale'))

par_mat_std <- mutate(par_mat_std,single_sqft = sqrt_sqft*(home.type=="Single Family"),
                       th_sqft = sqrt_sqft*(home.type=="Townhouse"),
                       single_yearb = year.built*(home.type=="Single Family"),
                       th_yearb = year.built*(home.type=="Townhouse"),
                       single_yearb2 = year.built^2*(home.type=="Single Family"),
                       th_yearb2 = year.built^2*(home.type=="Townhouse"),
                       single_saledate = sale.date.int*(home.type=="Single Family"),
                       th_saledate = sale.date.int*(home.type=="Townhouse"),
                       single_saledate2 = sale.date.int^2*(home.type=="Single Family"),
                       th_saledate2 = sale.date.int^2*(home.type=="Townhouse"))

query_price = exp(predict(mod_lm3,par_mat_std)*attr(y_train,'scaled:scale')+attr(y_train,'scaled:center'))

final_res = cbind(par_mat,query_price)


#== long & lati with zip ==#
zip_longlat <- property_sub %>% group_by(zip_f) %>%
    summarise(lat = mean(latitude), long = mean(longitude))

final_res = merge(final_res,zip_longlat,by='zip_f')


#== plot map ==#

library(ggmap)

price_range = range(query_price)

final_res$shift = 0.01*as.integer(final_res$home.type)-0.015

map <- get_map(c(-122.4,47.7), zoom = 11,maptype="roadmap")

Question 0: Is the property price increasing in recent years?

Before we can ask any real question, let’s confirm on this belief: the price IS INCREASING.

mod_lm5= update(mod_lm3,.~.-sale.date.int-I(sale.date.int^2))

plot(training$sale.date.int,resid(mod_lm5))
abline(lm(resid(mod_lm5)~training$sale.date.int),col='green',lwd=2)
lines(lowess(resid(mod_lm5)~training$sale.date.int),col='red',lwd=2)

This is the trend after adjusting for other variables, and we are looking at the relationship between the price (residuals, technically speaking) and property sale date. Although the unit has been scaled, we can still observe a positive trend. In fact, if we compare the locally weighted smoothing line in red against a linear fitting in green, we see this price increase is even faster than a linear growth. However, the red line is nearly flat in the first 1/3 period of time within the investigated time interval (2010-08-06 to 2016-01-27). This tells us this speedup in the price growth rate happened only recently.

Question 1: What are the expected prices like in 2010?

ggmap(map) + 
    geom_point(aes(x = long+shift, y = lat, fill = query_price,group=home.type,shape=home.type),data = subset(final_res,sale.date.int==as.POSIXct('2010-1-1')-as.POSIXct('2009-1-1')),
               alpha = .8,colour='black',size=4)+
    scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) + 
    scale_fill_gradientn(colours = topo.colors(10),limits=price_range,
                         labels=c("200k","400k","600k","800k","1M")) +
    ggtitle("Predicted property price on 2010-1-1") +
    xlab("Longitude") +
    ylab("Latitude") +
    theme(plot.title = element_text(lineheight=1, face="bold",size=20),
          axis.text=element_text(size=10),axis.title=element_text(size=15),
          legend.position=c(0,1),legend.justification=c(0,1),
          legend.title=element_text(size=15),
          legend.text=element_text(size=15),
          legend.key.height=unit(2,"line"),
          legend.box = "horizontal",
          legend.box.just = "top")

We see a price distribution as expected, with the properties in the downtown area the most expensive and those getting cheaper as we move further north.

Question 2: What are the expected prices like in 2016?

ggmap(map) + 
    geom_point(aes(x = long+shift, y = lat, fill = query_price,group=home.type,shape=home.type),data = subset(final_res,sale.date.int==as.POSIXct('2016-1-1')-as.POSIXct('2009-1-1')),
               alpha = .8,colour='black',size=4)+
    scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) + 
    scale_fill_gradientn(colours = topo.colors(10),limits=price_range,
                         labels=c("200k","400k","600k","800k","1M")) +
    ggtitle("Predicted property price on 2016-1-1") +
    xlab("Longitude") +
    ylab("Latitude") +
    theme(plot.title = element_text(lineheight=1, face="bold",size=20),
          axis.text=element_text(size=10),axis.title=element_text(size=15),
          legend.position=c(0,1),legend.justification=c(0,1),
          legend.title=element_text(size=15),
          legend.text=element_text(size=15),
          legend.key.height=unit(2,"line"),
          legend.box = "horizontal",
          legend.box.just = "top")

Question 3: How are the expected prices changed bewteen 2015 and 2016?

price_year_chg = final_res %>%
    select(zip_f,home.type,sale.date.int,shift,query_price,long,lat) %>%
    filter(sale.date.int %in% as.integer(as.POSIXct(c('2016-1-1','2015-1-1'))-as.POSIXct('2009-1-1'))) %>%
    arrange(zip_f,home.type,sale.date.int) %>%
    mutate(price_lag = lag(query_price),diff_price = (query_price-price_lag), diff_price_prop = diff_price/price_lag) %>%
    filter(sale.date.int == as.POSIXct('2016-1-1')-as.POSIXct('2009-1-1'))

price_chg_range = range(price_year_chg$diff_price)

ggmap(map) + 
    geom_point(aes(x = long+shift, y = lat, fill = diff_price,group=home.type,shape=home.type),
               data = price_year_chg,alpha = .8,colour='black',size=4)+
    scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) + 
    scale_fill_gradientn(colours = topo.colors(10),limits=price_chg_range,name="Price_diff") +
    ggtitle("Price change between 2015-1-1 and 2016-1-1") +
    xlab("Longitude") +
    ylab("Latitude") +
    theme(plot.title = element_text(lineheight=1, face="bold",size=20),
          axis.text=element_text(size=10),axis.title=element_text(size=15),
          legend.position=c(0,1),legend.justification=c(0,1),
          legend.title=element_text(size=15),
          legend.text=element_text(size=15),
          legend.key.height=unit(2,"line"),
          legend.box = "horizontal",
          legend.box.just = "top")

Question 4: the percentage of the price change in Question 3?

price_chg_range = range(price_year_chg$diff_price_prop*100)

ggmap(map) + 
    geom_point(aes(x = long+shift, y = lat, fill = diff_price_prop*100,group=home.type,shape=home.type),
               data = price_year_chg,alpha = .8,colour='black',size=4)+
    scale_shape_manual(values = c(21:23),labels=c("Single Family","Townhouse","Condo")) + 
    scale_fill_gradientn(colours = topo.colors(10),limits=price_chg_range,name="Price_diff %") +
    ggtitle("Price change percentage between 2015-1-1 and 2016-1-1") +
    xlab("Longitude") +
    ylab("Latitude") +
    theme(plot.title = element_text(lineheight=1, face="bold",size=20),
          axis.text=element_text(size=10),axis.title=element_text(size=15),
          legend.position=c(0,1),legend.justification=c(0,1),
          legend.title=element_text(size=15),
          legend.text=element_text(size=15),
          legend.key.height=unit(2,"line"),
          legend.box = "horizontal",
          legend.box.just = "top")

This is less informative in terms of the lack of difference in all areas. This is due to not including the interaction between zip code and sale date.

Now we look at the annual price change percentage.

price_year_chg = final_res %>%
    select(zip_f,home.type,sale.date.int,shift,query_price,long,lat) %>%
    arrange(zip_f,home.type,sale.date.int) %>%
    mutate(price_lag = lag(query_price),diff_price = (query_price-price_lag), diff_price_prop = diff_price/price_lag) %>%
    filter(zip_f==98036,sale.date.int!=365)

library(scales)    #for percent in the next plot
xlab_date <- function(x){
    as.Date('2009-1-1')+x
}

ggplot(aes(x=sale.date.int,y=diff_price_prop,group=home.type,color=home.type),data=price_year_chg) +
    geom_point() +
    geom_line() +
    ggtitle("Predicted annual price change") +
    scale_x_continuous(name="Sale date",labels = xlab_date,breaks = unique(price_year_chg$sale.date.int)) +
    scale_y_continuous(name="Price_diff %",labels = percent) +
    theme(plot.title = element_text(lineheight=1, face="bold",size=20),
          axis.text=element_text(size=10),axis.title=element_text(size=15),
          legend.position=c(0,1),legend.justification=c(0,1),
          legend.title=element_text(size=15),
          legend.text=element_text(size=15),
          legend.key.height=unit(2,"line"),
          legend.box = "horizontal",
          legend.box.just = "top")

We see that despite of the fact that condos are the cheapest properties with the same property parameters, their price growth rate was the lowest before 2013, and exceeded that of townhouses and eventually single family houses in 2014~2015. Note that the linearity in increasing growth rate comes from the 2st degree polynomial in our ‘mod_lm3’ model.

Concluding remarks