Outline:

  1. Problem Statement
  2. Feature Engineering
  3. Exploratory Data Analysis
  4. Linear Regression
  5. Gradient Boosting
  6. Conclusion
  7. Next Step

1. Problem Statement

Home price is a popular and well discussed topic, not only among home buyers, but also among investors, real-estate agencies, online listing platforms, etc. A traditional way to predict home prices is by analyzing house characteritics, such as location, square feet, building year, building materials, etc. However, with the increasing data availabilities, more complex and accurate models utilizing unconventional datasets or modeling techniques are emerging. This project aims to explore the relationship between home prices and various weather observations to better predict home prices.

2. Feature Engineering

Load depository path and functions

path<-'/Users/xucao/Google Drive/MyWorkStation/Projects/ODG-Weather-HousingPrice/'
dataloc<-paste0(path,'DATA/')
codeloc<-paste0(path,'CODE/')
output <-paste0(path,'OUTPUT/')
source(paste0(codeloc,"functions.r"))
source(paste0(codeloc,"attr.bivar.R"))             #self defined bivariate plot function

Read in text files

read in housing data
housing_data<-fread(paste0(dataloc,'housing_data_ACS_15_5YR_DP04_with_ann.csv'),header=TRUE)
#rename zip code column
colnames(housing_data)[2]<-'ZIP'
#convert zip code column to numeric
housing_data$ZIP=as.numeric((housing_data$ZIP))
#drop the top row with variables descriptions
housing_data<-housing_data[2:nrow(housing_data),]
read in geographic data
Gaz_zcta_national<-fread(paste0(dataloc,'2015_Gaz_zcta_national.txt'),header=TRUE)
colnames(Gaz_zcta_national)[1]<-'ZIP'
#convert integer64 columns to numeric 
Gaz_zcta_national$ALAND<-as.numeric(Gaz_zcta_national$ALAND)
Gaz_zcta_national$AWATER<-as.numeric(Gaz_zcta_national$AWATER)
read in location data
weather_allstations<-read_fwf(file=paste0(dataloc,'weather_allstations.txt'),fwf_widths(c(12,9,10,7,3,31,4,4,6)))
#renaming columns
names(weather_allstations)<-c('STATION_ID','LAT','LONG','ELEVATION','STATE','STATION_NAME','SOURCE1','SOURCE2','ZIP_STATION')
#subset useful variables
weather_allstations<-weather_allstations[,c('STATION_ID','ELEVATION','STATE')]
read in zipcode to weather station mapping table
weather_zipcodes_stations<-read_fwf(file=paste0(dataloc,'weather_zipcodes-normals-stations.txt'),fwf_widths(c(12,6,50)))
#Renaming columns
names(weather_zipcodes_stations)<-c('STATION_ID','ZIP','CITY')
#Formatting column 'CITY' and 'ZIP', all 'ZIP' columns are converted to numeric for matching purposes
weather_zipcodes_stations$CITY<-toupper(weather_zipcodes_stations$CITY)
weather_zipcodes_stations$ZIP<-as.numeric(weather_zipcodes_stations$ZIP)
#Create primary key
weather_zipcodes_stations$STATION<-paste0(formatC(weather_zipcodes_stations$ZIP, width=5, flag="0"),toupper(weather_zipcodes_stations$CITY))
read in weather data
namelist<-c('STATION_ID','Weat_Jan','Weat_Jan_Type','Weat_Feb','Weat_Feb_Type','Weat_Mar','Weat_Mar_Type','Weat_Apr','Weat_Apr_Type','Weat_May','Weat_May_Type',
                       'Weat_Jun','Weat_Jun_Type','Weat_Jul','Weat_Jul_Type','Weat_Aug','Weat_Aug_Type','Weat_Sep','Weat_Sep_Type','Weat_Oct','Weat_Oct_Type','Weat_Nov','Weat_Nov_Type',
                       'Weat_Dec','Weat_Dec_Type')

weather_prcp<-read_fwf(file=paste0(dataloc,'weather_mly-prcp-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
weather_tavg<-read_fwf(file=paste0(dataloc,'weather_mly-tavg-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
weather_tmax<-read_fwf(file=paste0(dataloc,'weather_mly-tmax-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
weather_tmin<-read_fwf(file=paste0(dataloc,'weather_mly-tmin-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))

names(weather_prcp)<-paste0(namelist,'_prcp')
names(weather_prcp)[1]<-'STATION_ID'
names(weather_tavg)<-paste0(namelist,'_tavg')
names(weather_tavg)[1]<-'STATION_ID'
names(weather_tmax)<-paste0(namelist,'_tmax')
names(weather_tmax)[1]<-'STATION_ID'
names(weather_tmin)<-paste0(namelist,'_tmin')
names(weather_tmin)[1]<-'STATION_ID'

#precipitation cannot be negative, so we convert negative values to NA
for (i in 1:ncol(weather_prcp)){
  if (class(weather_prcp[,i])=='numeric'){
    weather_prcp[,i]<-ifelse(weather_prcp[,i]<0,'NA',weather_prcp[,i])
  }
}


clpcdy15<-separate(read_fwf(file=paste0(dataloc,'clpcdy15.txt'),fwf_widths(c(38,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4)),skip=2),col='X1',into=c('STATION','STATE'),sep=',' )
names(clpcdy15)=c('STATION','STATE','YRS','Jan_CL','Jan_PC','Jan_CD','Feb_CL','Feb_PC','Feb_CD','Mar_CL','Mar_PC','Mar_CD','Apr_CL','Apr_PC','Apr_CD','May_CL','May_PC','May_CD','Jun_CL','Jun_PC','Jun_CD','Jul_CL','Jul_PC','Jul_CD','Aug_CL','Aug_PC','Aug_CD','Sep_CL','Sep_PC','Sep_CD','Oct_CL','Oct_PC','Oct_CD','Nov_CL','Nov_PC','Nov_CD','Dec_CL','Dec_PC','Dec_CD','Ann_CL','Ann_PC','Ann_CD')

#Base on Master Location Identifier Database downloaded from 'http://www.weathergraphics.com/identifiers/', WBAN is the weather station identifiersm. The first column of dataset clpcdy15 and pctpos15 consists of WBAN and CITY name.
clpcdy15$WBAN<-as.numeric(substr(clpcdy15$STATION,1,5))
MLID<-fread(file=paste0(dataloc,'MLID.csv'),header=TRUE)
MLID<-MLID%>%select(WBAN, (CITY))
MLID$CITY<-toupper(MLID$CITY)
clpcdy15<-clpcdy15%>%left_join(MLID,by='WBAN')
clpcdy15$CITY<-coalesce(clpcdy15$CITY,substr(clpcdy15$STATION, 6, 50))
clpcdy15$STATION=paste0(clpcdy15$WBAN,clpcdy15$CITY)
   
pctpos15<-separate(read_fwf(file=paste0(dataloc,'pctpos15.txt'),fwf_widths(c(37,16,6,6,6,6,6,6,6,6,6,6,6,6,6)),skip=2),col='X1',into=c('STATION','STATE'),sep=',' )   
names(pctpos15)=c('STATION','STATE','POR','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC','ANN')
pctpos15$WBAN<-as.numeric(substr(pctpos15$STATION,1,5))
pctpos15$CITY<-substr(pctpos15$STATION, 6, 50)

Check variable datatype

sapply(housing_data, class) 
#We only keep the actual values (HC01), and drop other statistics columns, such as margin of error, percentage, etc.
housing_data<-housing_data%>%select(ZIP,starts_with('HC01_'))
#convert to numerica data
housing=as.data.frame(sapply(housing_data, as.numeric))
sapply(Gaz_zcta_national, class) 
sapply(weather_allstations, class) 
weather_allstations$ELEVATION<-as.numeric(weather_allstations$ELEVATION)
sapply(weather_zipcodes_stations, class) 
sapply(weather_prcp, class) 
sapply(weather_tavg, class) 
#convert character tempurature data into numeric data
weather_tavg$Weat_Jul_tavg<-as.numeric(weather_tavg$Weat_Jul_tavg)
weather_tavg$Weat_Aug_tavg<-as.numeric(weather_tavg$Weat_Aug_tavg)
sapply(weather_tmax, class) 
#convert character tempurature data into numeric data
weather_tmax$Weat_May_tmax<-as.numeric(weather_tmax$Weat_May_tmax)
weather_tmax$Weat_Jun_tmax<-as.numeric(weather_tmax$Weat_Jun_tmax)
weather_tmax$Weat_Jul_tmax<-as.numeric(weather_tmax$Weat_Jul_tmax)
weather_tmax$Weat_Aug_tmax<-as.numeric(weather_tmax$Weat_Aug_tmax)
weather_tmax$Weat_Sep_tmax<-as.numeric(weather_tmax$Weat_Sep_tmax)
sapply(weather_tmin, class) 
sapply(clpcdy15, class) 
clpcdy15<-cbind(clpcdy15[1:3], lapply(clpcdy15[4:42], as.numeric) ,clpcdy15[43:44])
sapply(pctpos15, class) 

Check datasets primary key duplicates

clpcdy15$STATION[duplicated(clpcdy15$STATION)]
#dedup on primary key
clpcdy15<-clpcdy15[!duplicated(clpcdy15$STATION),]
pctpos15$STATION[duplicated(pctpos15$STATION)]
pctpos15<-pctpos15[!duplicated(pctpos15$STATION),]
housing$ZIP[duplicated(housing$ZIP)]
Gaz_zcta_national$ZIP[duplicated(Gaz_zcta_national$ZIP)]
weather_allstations$STATION_ID[duplicated(weather_allstations$STATION_ID)]
weather_zipcodes_stations$STATION_ID[duplicated(weather_zipcodes_stations$STATION_ID)]
weather_prcp$STATION_ID[duplicated(weather_prcp$STATION_ID)]
weather_tavg$STATION_ID[duplicated(weather_tavg$STATION_ID)]
weather_tmax$STATION_ID[duplicated(weather_tmax$STATION_ID)]
weather_tmin$STATION_ID[duplicated(weather_tmin$STATION_ID)]

Merge datasets

data<-weather_zipcodes_stations%>%left_join(weather_allstations,by='STATION_ID')%>%left_join(weather_prcp,by='STATION_ID')%>%left_join(weather_tavg,by='STATION_ID')%>%left_join(weather_tmax,by='STATION_ID')%>%left_join(weather_tmin,by='STATION_ID')%>%left_join(housing,by='ZIP')%>%left_join(Gaz_zcta_national,by='ZIP')%>%left_join(pctpos15,by='STATION')%>%left_join(clpcdy15,by='STATION')

Examine target

summary(data$HC01_VC128)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13000   88200  123700  159727  181300 1732700     371
#drop records with NA target, these records without target informaiton are not useful to our supervised learning analysis.
data<-data[!is.na(data$HC01_VC128),]

#simulate gamma distribution 
theta=(sd(log(data$HC01_VC128)))^2/mean(log(data$HC01_VC128))
k=mean(log(data$HC01_VC128))/theta
gamma<-as.data.frame(rgamma(nrow(data),shape=k,scale=theta))
gamma$source<-'gamma'
names(gamma)=c('value','source')
#simulate normal distribution 
mean<-mean(log(data$HC01_VC128))
sd<-sd(log(data$HC01_VC128))
norm<-as.data.frame(rnorm(nrow(data),mean=mean,sd=sd))
norm$source='norm'
names(norm)=c('value','source')
#plot actual dependent variable distribution 
actual<-as.data.frame(log(data$HC01_VC128))
actual$source='actual'
names(actual)=c('value','source')
dist<-rbind(gamma,norm,actual)
dist$value<-as.numeric(dist$value)
ggplot(dist,aes(value, fill=source))+geom_density(alpha=0.5)

#based on the above analysis, by comparing the distribution shape, a log normal distribution will be used for the target.
data$Target<-log(data$HC01_VC128)

Split Train/Test/Holdout: 50%/25%/25%

This step is to prepare data for future modeling practice, and we only conduct featue engineering and exploratory analysis on the training set.

set.seed(1)
inTrainingSet = createDataPartition(log(data$HC01_VC128),p=.5, list = FALSE)
dt = data[inTrainingSet,]
trainset = data[-inTrainingSet,]
set.seed(123)
inTrainingSet = createDataPartition(log(dt$HC01_VC128),p=.5, list = FALSE)
holdout = dt[inTrainingSet,]
testset = dt[-inTrainingSet,]

gbm.trainset<-trainset
gbm.testset<-testset

Check dependent variable distributions in different datasets. We are looking for similar dependent variable distributions among train/test/holdout sets to ensure that we do not have biased sample selection.

##train number of records:
nrow(trainset)
## [1] 4710
summary(log(trainset$HC01_VC128))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.473  11.387  11.726  11.793  12.108  14.362
train_target_dist<-cbind(log(trainset$HC01_VC128),'trainset')
##test number of records:
nrow(testset)
## [1] 2355
summary(log(testset$HC01_VC128))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.10   11.38   11.72   11.79   12.11   14.33
test_target_dist<-cbind(log(testset$HC01_VC128),'testset')
##holdout number of records:
nrow(holdout)
## [1] 2358
summary(log(holdout$HC01_VC128))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.473  11.387  11.726  11.799  12.107  14.365
holdout_target_dist<-cbind(log(holdout$HC01_VC128),'holdout')

target_dist<-as.data.frame(rbind(train_target_dist,test_target_dist,holdout_target_dist))
names(target_dist)<-c('value','source')
target_dist$value<-as.numeric(target_dist$value)
ggplot(target_dist,aes(value, fill=source))+geom_density(alpha=0.5)

Before imputing missing values, we create missing flags and cap outliers first, as missing flags carry meaningful information, and outliers can impact missing value imputation.

Create missing flags

for (i in 1:ncol(trainset)){
  if(colnames(trainset)[i]!='HC01_VC128' & colnames(trainset)[i]!='Target'){
    trainset[,paste0("flag_",colnames(trainset)[i])]<-as.factor((ifelse(is.na(trainset[,i]),1,0)))
  }
}

Cap outliers

Due to the constrained time, here we use 5% and 95% quantiles to cap all the attributes. For future study, we can run chi-square outlier test and treat each attribute separately for outlier capping.

#boxplot shows outliers for dependent variable
ggplot(data=trainset,aes(x=1,y=trainset$HC01_VC128))+geom_boxplot()+xlab('Boxplot')+ylab('')

#tranform and cap dependent variable
trainset$Target<-pmin(log(trainset$HC01_VC128),quantile(log(trainset$HC01_VC128),0.99))
trainset<-trainset%>%select(-HC01_VC128)

#cap numeric independent variable
for (i in 1:ncol(trainset)){
if ((substr(colnames(trainset)[i],1,5)=='Weat_'|substr(colnames(trainset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))))=='numeric'|class(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))))=='integer')){
  trainset[,paste0(colnames(trainset)[i],'_cap')]<-pmin(pmax(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))),quantile(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))),0.05,na.rm=TRUE)),quantile(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))),0.95,na.rm=TRUE))
}
}

#save a capping value table for testset transformation later
trainset.num<-trainset[,sapply(trainset,class)=='numeric'|sapply(trainset,class)=='integer']
capping<-as.matrix(rbind(colnames(trainset.num),sapply(trainset.num,function(x) quantile(x,0.05,na.rm=TRUE)),sapply(trainset.num,function(x) quantile(x,0.95,na.rm=TRUE))))
colnames(capping)<-capping[1,]
capping<-as.data.frame(capping[2:3,])

Impute missing values

Regarding to missing values, we treat variables from different datasets differently.

  1. For weather related variables, climate is highy correlated with location, and areas within the same state share similar climate, therefore one way to impute the weather data is to use the state mean, and we have fully populated state information in the dataset, which makes this approach plausible.

  2. For house related variables, we can try two different approches, one is to utilize bivariate plot to go through the relationships between each indenpent variable and independent variables, and impute the missing values with user defined statistical method (mean, min, max, zero, etc.), or we can use k nearest neighor to impute the missing values. Here we use k-nearest neighbor to impute our housing variables.

Select useful columns

trainset<-as.data.frame(trainset%>%select(Target,ends_with('_cap'),starts_with('flag_'),contains('_Type'),ELEVATION,STATE.x,ALAND,AWATER))

Seperate different data type columns

colclasses <- sapply(trainset,class)
numColumns <- which(colclasses=="numeric"|colclasses=="integer")
JustNumbers<-trainset%>%select(STATE.x,numColumns)

Create state average lookup table for weather related attributes imputation

weat_state<-list()
for (i in seq_along(1:(ncol(JustNumbers)))){
  if (i==1){
   weat_state[[i]]<-as.vector((JustNumbers%>%select(STATE.x,colnames(JustNumbers)[i])%>%group_by(STATE.x)%>%summarise(avg=mean(eval(parse(text=colnames(JustNumbers)[i])),na.rm=TRUE)))[,1])  
  } else{
  weat_state[[i]]<-as.vector((JustNumbers%>%select(STATE.x,colnames(JustNumbers)[i])%>%group_by(STATE.x)%>%summarise(avg=mean(eval(parse(text=colnames(JustNumbers)[i])),na.rm=TRUE)))[,2])
  }
}

weat_state_avg<-matrix(nrow=51)
for (i in 1:length(weat_state)){
  weat_state_avg<-cbind(weat_state_avg,(unlist(weat_state[i])))
}
weat_state_avg<-as_tibble((weat_state_avg))[,2:ncol(weat_state_avg)]
weat_state_avg<-cbind(weat_state_avg[,1],sapply(weat_state_avg[,2:ncol(weat_state_avg)],as.numeric))
names(weat_state_avg)=c('STATE.x',paste0('state_avg_',colnames(JustNumbers)[2:ncol(JustNumbers)]))

trainset<-as.data.frame(trainset%>%left_join(weat_state_avg,by='STATE.x'))

Attributes imputation

#Impute house characteristics data first with k nearest neighbor
HC_01<-trainset%>%select(starts_with('HC01_'))
HC_01_Imputed<-knnImputation(HC_01,k=3)
names(HC_01_Imputed)<-paste0('Impute_',colnames(HC_01_Imputed))
  
#Imput weather factor data with mode and numeric data with state average
for (i in 1:ncol(trainset)){
  if((class(trainset[,i])=='character'|class(trainset[,i])=='factor')&substr(colnames(trainset)[i],1,5)!='flag_'){   #Impute Weather type data and factors
    trainset[,i][trainset[,i] == 'P'] <- NA
    trainset[,paste0("F_",colnames(trainset)[i])]<-impute(as.factor(trainset[,i]),mode)
  }
  else if((substr(colnames(trainset)[i],1,5)=='Weat_')&(class(trainset[,i])=='numeric'|class(trainset[,i])=='integer')&(substr(colnames(trainset)[i],1,10)!='state_avg_')){ #Impute Weather numeric data
    trainset[,i]<-as.numeric(trainset[,i])
    trainset[,paste0("Impute_",colnames(trainset)[i])]=coalesce(trainset[,i],eval(parse(text=paste0('trainset$state_avg_',colnames(trainset)[i]))))#Impute state average
  }
}

trainset<-cbind(trainset,HC_01_Imputed)
trainset<-as.data.frame(trainset%>%select(Target,starts_with('flag_'),starts_with('Impute_'),starts_with('F_'),ELEVATION, ALAND, AWATER))

Create climate regions

Climate is highly impacted by the geograpic feature of the location. However, how granular our data needs to be regarding to geographic feature is debatable. In this project, we include both State and Climate regions information in our model.

trainset$Regions<-as.factor(ifelse(trainset$F_STATE.x%in%c('NT','WY','CO','ND','SD','NE','KS'),'northwest',ifelse(trainset$F_STATE.x%in%c('MN','IA','IL','WI','MO','OH','IN','MI'),'midwest',ifelse(trainset$F_STATE.x%in%c('AZ','NM','TX','OK'),'southwest',ifelse(trainset$F_STATE.x%in%c('WA','OR','NV','ID','UT','CA'),'westcoast',ifelse(trainset$F_STATE.x%in%c('AR','LA','MS','TN','AL','GA','FL','SC','NC','VA','WV','KY'),'south',('northeast')))))))

# trainset<-trainset%>%select(-F_STATE.x,)

3. Exploratory Analysis

After feature engineering, now we can look at some of the important weather attributes, and examine their relationships with home price.

precipitation

#spring months
bivar.plot(trainset,'Impute_Weat_Apr_prcp_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_May_prcp_cap','Target',n.rank=50)

#summer months
bivar.plot(trainset,'Impute_Weat_Jun_prcp_cap','Target',n.rank=50)

#fall months
bivar.plot(trainset,'Impute_Weat_Sep_prcp_cap','Target',n.rank=50)

#winter months
bivar.plot(trainset,'Impute_Weat_Jan_prcp_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Feb_prcp_cap','Target',n.rank=50)

From the above charts, we see that higher precipitation areas has lower home price in summer months, however higher home price in winter months. One interesting observation about this attribute is that in September, precipitation has a nonlinear relationship with home price.

To deal with this relationship, we can create two variables to represent different trends in different range. Here we use WOE to find the cutoff point.

WOE formula:

WOE=ln(\(\frac{p(non-event)}{p(event)}\))

IV=\(\sum_{i=1}^n (DistributionGood-DistributionBad)\)*WOE

WOE_numeric_split(x='Impute_Weat_Sep_prcp_cap',y1='Target',data=trainset,group=10)
## $WOE
## # A tibble: 7 x 14
##     grp n_obs mean_x sum_y1 sum_y0 sum_w mean_y1 mean_y0 LowerBound
##   <int> <int>  <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   470   56.2  5737.  5541.   470    12.2    11.8         45
## 2     2   469  116.   5593.  5530.   469    11.9    11.8         89
## 3     3   473  162.   5578.  5577.   473    11.8    11.8        138
## 4     4   928  279.  10735. 10941.   928    11.6    11.8        199
## 5     5   955  348.  11150. 11260.   955    11.7    11.8        324
## 6     6   943  405.  11125. 11118.   943    11.8    11.8        375
## 7     7   472  497.   5613.  5565.   472    11.9    11.8        450
## # ... with 5 more variables: UpperBound <dbl>, woe <dbl>, ks <dbl>,
## #   info <dbl>, plot.width <dbl>
## 
## $Stat
##         MAX_KS     Info.Value Trend.Estimate Trend.Pr(>|t|) 
##   4.698438e-01   2.337355e-04  -5.022986e-05   3.243015e-01 
## 
## $WOE_Code
## [1] "trainset$w_Impute_Weat_Sep_prcp_cap = trainset$Impute_Weat_Sep_prcp_cap"                                                                                            
## [2] "g_Impute_Weat_Sep_prcp_cap = c(45, 89, 138, 199, 324, 375, 450, 518)"                                                                                               
## [3] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 1] = 0.0347702723323038"  
## [4] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 2] = 0.0113994413145294"  
## [5] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 3] = 0.000260558517112001"
## [6] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 4] = -0.0190688743527721" 
## [7] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 5] = -0.00979637863295558"
## [8] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 6] = 0.000641267961459832"
## [9] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 7] = 0.0086564356766699"  
## 
## $Plot

Seen from the above WOE graph, the trend changed from negative to positive at bin 4, and we can find the mean of the independent variable for bin 4 from the table, which is -0.08290098. So we use this value as our cutoff value.

trainset$Impute_Weat_Sep_prcp_g1_cap<-pmax(-0.08290098-trainset$Impute_Weat_Sep_prcp_cap,0)
trainset$Impute_Weat_Sep_prcp_g2_cap<-pmax(trainset$Impute_Weat_Sep_prcp_cap+0.08290098,0)

trainset<-trainset%>%select(-Impute_Weat_Sep_prcp_cap)

average tempurature

#spring months
bivar.plot(trainset,'Impute_Weat_Apr_tavg_cap','Target',n.rank=50)

#summer months
bivar.plot(trainset,'Impute_Weat_Jun_tavg_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Jul_tavg_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Aug_tavg_cap','Target',n.rank=50)

#winter months
bivar.plot(trainset,'Impute_Weat_Jan_tavg_cap','Target',n.rank=50)

From the above charts, we see that home price is higher when summer months tempurature is lower, winter months tempurature does not have as big of an impact to the home price as summer tempurater.

maximum tempurature

#spring months
bivar.plot(trainset,'Impute_Weat_Apr_tmax_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_May_tmax_cap','Target',n.rank=50)

#summer months
bivar.plot(trainset,'Impute_Weat_Jun_tmax_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Jul_tmax_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Aug_tmax_cap','Target',n.rank=50)

#winter months
bivar.plot(trainset,'Impute_Weat_Dec_tmax_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Jan_tmax_cap','Target',n.rank=50)

minimum tempurature

#spring months
bivar.plot(trainset,'Impute_Weat_May_tmin_cap','Target',n.rank=50)

#summer months
bivar.plot(trainset,'Impute_Weat_Jun_tmin_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Jul_tmin_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Aug_tmin_cap','Target',n.rank=50)

#winter months
bivar.plot(trainset,'Impute_Weat_Dec_tmin_cap','Target',n.rank=50)

bivar.plot(trainset,'Impute_Weat_Jan_tmin_cap','Target',n.rank=50)

regions

trainset%>%group_by(Regions)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 6 x 2
##   Regions   avg_home_price
##   <fct>              <dbl>
## 1 midwest             11.7
## 2 northeast           12.2
## 3 northwest           11.6
## 4 south               11.6
## 5 southwest           11.5
## 6 westcoast           12.2

Based on the exploratory analysis, we can see that geographic location, house characteristics and weather data are all predictable. The home price varies depending on which state the house is located, the house characteristics, and the location tempurature and the percentage of snowfall.

recording type

As for each record, the recording type is the same for weather attributes, so here we only look at one recording type attribute as an example.

trainset%>%group_by(F_Weat_Jan_Type_prcp)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
##   F_Weat_Jan_Type_prcp avg_home_price
##   <fct>                         <dbl>
## 1 C                              11.8
## 2 Q                              11.9
## 3 R                              11.8
## 4 S                              11.7
trainset%>%group_by(F_Weat_Jan_Type_tavg)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
##   F_Weat_Jan_Type_tavg avg_home_price
##   <fct>                         <dbl>
## 1 C                              11.8
## 2 Q                              11.8
## 3 R                              11.8
## 4 S                              11.8
trainset%>%group_by(F_Weat_Jan_Type_tmax)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
##   F_Weat_Jan_Type_tmax avg_home_price
##   <fct>                         <dbl>
## 1 C                              11.8
## 2 Q                              11.8
## 3 R                              11.8
## 4 S                              11.8
trainset%>%group_by(F_Weat_Jan_Type_tmin)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
##   F_Weat_Jan_Type_tmin avg_home_price
##   <fct>                         <dbl>
## 1 C                              11.9
## 2 Q                              11.8
## 3 R                              11.8
## 4 S                              11.8

4. Linear Regression

Standardize numeric attributes

As our weather and house data have very different scales, to better compare the coefficients of our models, we standardize the attributes before running models.

colclasses <- sapply(trainset,class)
numColumns <- which(colclasses=="numeric"|colclasses=="integer")
JustNumbers<-trainset%>%select(numColumns)

preObj <- preProcess(JustNumbers[, -1], method=c("center", "scale"))
scaled.JustNumbers <- predict(preObj, JustNumbers[, -1])

othrColumns<-trainset%>%select(-numColumns)

trainset<-cbind(trainset$Target,othrColumns,scaled.JustNumbers)
names(trainset)[1]<-'Target'

Before variable reduction, we create dummy variables for all the factor columns first.

trainset<-Filter(function(x)(length(unique(x))>1), trainset)
x <- model.matrix(Target~., trainset)[,-1]
y<-trainset$Target

z<-as.data.frame(x)
colnames(z)<-make.names(names(as.data.frame(z)), unique = FALSE, allow_ = TRUE)
trainset<-cbind(y,as.data.frame(z))
names(trainset)[1]<-'Target'

Variable reduction

Lasso regression

trainset<-Filter(function(x)(length(unique(x))>1), trainset)

set.seed(123)
cv.lasso <- cv.glmnet(x, (y), type.measure='mse',nfolds=5, alpha = 1, family = "gaussian")
cv.lasso$lambda.min
## [1] 0.0006357814
model <- glmnet(x, y, alpha = 1, family = "gaussian",
                lambda = cv.lasso$lambda.min)
print(model)
## 
## Call:  glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = cv.lasso$lambda.min) 
## 
##       Df   %Dev    Lambda
## [1,] 205 0.8522 0.0006358
rank<-caret::getModelInfo("glmnet")$glmnet$varImp(model,lambda=cv.lasso$lambda.min)

#check variables with 0 coefficients
c<-coef(cv.lasso,s='lambda.min',exact=TRUE)
inds<-which(c==0)
variables<-row.names(c)[sort(inds,decreasing = FALSE)]
print(variables[1:50])
##  [1] "flag_Weat_Jan_Type_prcp1" "flag_Weat_Feb_prcp1"     
##  [3] "flag_Weat_Feb_Type_prcp1" "flag_Weat_Mar_prcp1"     
##  [5] "flag_Weat_Mar_Type_prcp1" "flag_Weat_Apr_prcp1"     
##  [7] "flag_Weat_Apr_Type_prcp1" "flag_Weat_May_prcp1"     
##  [9] "flag_Weat_May_Type_prcp1" "flag_Weat_Jun_prcp1"     
## [11] "flag_Weat_Jun_Type_prcp1" "flag_Weat_Jul_prcp1"     
## [13] "flag_Weat_Jul_Type_prcp1" "flag_Weat_Aug_prcp1"     
## [15] "flag_Weat_Aug_Type_prcp1" "flag_Weat_Sep_prcp1"     
## [17] "flag_Weat_Sep_Type_prcp1" "flag_Weat_Oct_prcp1"     
## [19] "flag_Weat_Oct_Type_prcp1" "flag_Weat_Nov_prcp1"     
## [21] "flag_Weat_Nov_Type_prcp1" "flag_Weat_Dec_prcp1"     
## [23] "flag_Weat_Dec_Type_prcp1" "flag_Weat_Jan_Type_tavg1"
## [25] "flag_Weat_Feb_tavg1"      "flag_Weat_Feb_Type_tavg1"
## [27] "flag_Weat_Mar_tavg1"      "flag_Weat_Mar_Type_tavg1"
## [29] "flag_Weat_Apr_tavg1"      "flag_Weat_Apr_Type_tavg1"
## [31] "flag_Weat_May_tavg1"      "flag_Weat_May_Type_tavg1"
## [33] "flag_Weat_Jun_tavg1"      "flag_Weat_Jun_Type_tavg1"
## [35] "flag_Weat_Jul_tavg1"      "flag_Weat_Jul_Type_tavg1"
## [37] "flag_Weat_Aug_tavg1"      "flag_Weat_Aug_Type_tavg1"
## [39] "flag_Weat_Sep_tavg1"      "flag_Weat_Sep_Type_tavg1"
## [41] "flag_Weat_Oct_tavg1"      "flag_Weat_Oct_Type_tavg1"
## [43] "flag_Weat_Nov_tavg1"      "flag_Weat_Nov_Type_tavg1"
## [45] "flag_Weat_Dec_tavg1"      "flag_Weat_Dec_Type_tavg1"
## [47] "flag_Weat_Jan_tmax1"      "flag_Weat_Jan_Type_tmax1"
## [49] "flag_Weat_Feb_tmax1"      "flag_Weat_Feb_Type_tmax1"

From Lasso regression, we see that most of the variables with 0 coefficients are flag variables. This gives us a general idea on what variables to drop later.

Linear regression

Backward Selection
trainset=trainset[!is.na(trainset$Target),]
#drop no variation columns
trainset<-Filter(function(x)(length(unique(x))>1), trainset)

#model_1
base.model<-glm(Target~., data=trainset, family = 'gaussian')

#Get variable list with P-value less than 0.05
varslist<-summary(base.model)$coefficients[(summary(base.model)$coef[, "Pr(>|t|)"])<0.05&!is.na((summary(base.model)$coef[, "Pr(>|t|)"])<0.05),1]

#model_2
trainset2<-trainset[,c(names(varslist)[2:length(names(varslist))])]
trainset2<-cbind(trainset$Target,trainset2)
names(trainset2)[1]<-'Target'
base.model2<-glm(Target~., data=trainset2, family = 'gaussian')

#Get variable list with P-value less than 0.05
varslist2<-summary(base.model2)$coefficients[(summary(base.model2)$coef[, "Pr(>|t|)"])<0.05,1]
names(varslist2)
vif(lm(base.model2, data=trainset2))

#model_3
trainset3<-trainset2[,c(names(varslist2)[2:length(names(varslist2))])]
trainset3<-cbind(trainset$Target,trainset3)
names(trainset3)[1]<-'Target'
base.model3<-glm(Target~., data=trainset3, family = 'gaussian')
vif(lm(base.model3, data=trainset3))
#model_4
#Drop variables with high vif values in base.model3 
trainset4<-trainset3%>%select(-Impute_HC01_VC03_cap,-Impute_Weat_Oct_tmax_cap,-Impute_HC01_VC141_cap,-flag_HC01_VC691)
base.model4<-glm(Target~., data=trainset4, family = 'gaussian')
summary(base.model4)
## 
## Call:
## glm(formula = Target ~ ., family = "gaussian", data = trainset4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28188  -0.10381   0.01252   0.11990   1.72925  
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              11.794244   0.004107 2871.916  < 2e-16 ***
## flag_HC01_VC091           0.249563   0.056295    4.433 9.50e-06 ***
## flag_HC01_VC701           0.179391   0.039556    4.535 5.90e-06 ***
## flag_HC01_VC1461         -0.112936   0.027384   -4.124 3.79e-05 ***
## flag_HC01_VC1551          0.201619   0.031707    6.359 2.23e-10 ***
## F_STATE.xCO               0.096090   0.022293    4.310 1.66e-05 ***
## F_STATE.xHI               0.247879   0.031399    7.894 3.61e-15 ***
## F_STATE.xKS              -0.161327   0.018551   -8.697  < 2e-16 ***
## F_STATE.xMT               0.089192   0.021167    4.214 2.56e-05 ***
## F_STATE.xSD              -0.194366   0.024675   -7.877 4.14e-15 ***
## F_STATE.xTX              -0.201945   0.016089  -12.552  < 2e-16 ***
## F_STATE.xUT               0.122146   0.025860    4.723 2.39e-06 ***
## F_STATE.xVA               0.129047   0.027445    4.702 2.65e-06 ***
## F_STATE.xWY               0.144247   0.028322    5.093 3.66e-07 ***
## Impute_Weat_Jul_prcp_cap -0.028106   0.004596   -6.115 1.04e-09 ***
## Impute_Weat_Jan_tavg_cap  0.106466   0.009468   11.245  < 2e-16 ***
## Impute_Weat_Sep_tavg_cap -0.108954   0.009212  -11.827  < 2e-16 ***
## Impute_HC01_VC05_cap      0.012284   0.006380    1.925 0.054230 .  
## Impute_HC01_VC08_cap      0.010799   0.003707    2.913 0.003598 ** 
## Impute_HC01_VC21_cap      0.031565   0.006703    4.709 2.57e-06 ***
## Impute_HC01_VC42_cap      0.025022   0.006570    3.809 0.000141 ***
## Impute_HC01_VC50_cap      0.014090   0.004979    2.830 0.004673 ** 
## Impute_HC01_VC70_cap     -0.020904   0.003996   -5.231 1.76e-07 ***
## Impute_HC01_VC85_cap     -0.039027   0.007486   -5.213 1.94e-07 ***
## Impute_HC01_VC94_cap      0.021537   0.004491    4.796 1.67e-06 ***
## Impute_HC01_VC120_cap    -0.150008   0.009165  -16.367  < 2e-16 ***
## Impute_HC01_VC121_cap    -0.068646   0.011326   -6.061 1.46e-09 ***
## Impute_HC01_VC122_cap     0.014258   0.012468    1.144 0.252855    
## Impute_HC01_VC123_cap     0.034687   0.012259    2.829 0.004683 ** 
## Impute_HC01_VC125_cap     0.039963   0.012364    3.232 0.001237 ** 
## Impute_HC01_VC126_cap     0.050687   0.011607    4.367 1.29e-05 ***
## Impute_HC01_VC127_cap     0.023375   0.006624    3.529 0.000421 ***
## Impute_HC01_VC140_cap     0.124565   0.014076    8.849  < 2e-16 ***
## Impute_HC01_VC142_cap    -0.090684   0.013074   -6.936 4.58e-12 ***
## Impute_HC01_VC143_cap    -0.050596   0.014079   -3.594 0.000329 ***
## Impute_HC01_VC144_cap    -0.024033   0.012891   -1.864 0.062352 .  
## Impute_HC01_VC145_cap    -0.040346   0.011992   -3.364 0.000773 ***
## Impute_HC01_VC146_cap     0.328105   0.008051   40.753  < 2e-16 ***
## Impute_HC01_VC155_cap     0.062627   0.006626    9.451  < 2e-16 ***
## Impute_HC01_VC170_cap     0.095805   0.011102    8.630  < 2e-16 ***
## Impute_HC01_VC191_cap     0.075179   0.006644   11.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05424897)
## 
##     Null deviance: 1477.93  on 4709  degrees of freedom
## Residual deviance:  253.29  on 4669  degrees of freedom
## AIC: -316.53
## 
## Number of Fisher Scoring iterations: 2
Check multicollinearity with variance inflation factor
vif(lm(base.model4, data=trainset4))
##          flag_HC01_VC091          flag_HC01_VC701         flag_HC01_VC1461 
##                 1.856701                 1.988951                 2.097918 
##         flag_HC01_VC1551              F_STATE.xCO              F_STATE.xHI 
##                 2.043867                 1.106112                 1.306127 
##              F_STATE.xKS              F_STATE.xMT              F_STATE.xSD 
##                 1.109903                 1.160733                 1.077041 
##              F_STATE.xTX              F_STATE.xUT              F_STATE.xVA 
##                 1.566230                 1.171144                 1.051655 
##              F_STATE.xWY Impute_Weat_Jul_prcp_cap Impute_Weat_Jan_tavg_cap 
##                 1.091277                 1.833726                 7.780751 
## Impute_Weat_Sep_tavg_cap     Impute_HC01_VC05_cap     Impute_HC01_VC08_cap 
##                 7.366807                 3.532799                 1.193014 
##     Impute_HC01_VC21_cap     Impute_HC01_VC42_cap     Impute_HC01_VC50_cap 
##                 3.900621                 3.746380                 2.151665 
##     Impute_HC01_VC70_cap     Impute_HC01_VC85_cap     Impute_HC01_VC94_cap 
##                 1.385987                 4.865102                 1.750420 
##    Impute_HC01_VC120_cap    Impute_HC01_VC121_cap    Impute_HC01_VC122_cap 
##                 7.291768                11.135451                13.493400 
##    Impute_HC01_VC123_cap    Impute_HC01_VC125_cap    Impute_HC01_VC126_cap 
##                13.045743                13.270368                11.694692 
##    Impute_HC01_VC127_cap    Impute_HC01_VC140_cap    Impute_HC01_VC142_cap 
##                 3.808437                17.199512                14.837614 
##    Impute_HC01_VC143_cap    Impute_HC01_VC144_cap    Impute_HC01_VC145_cap 
##                17.206585                14.425669                12.483018 
##    Impute_HC01_VC146_cap    Impute_HC01_VC155_cap    Impute_HC01_VC170_cap 
##                 5.626486                 3.811400                10.698985 
##    Impute_HC01_VC191_cap 
##                 3.831942

We do not see extremly high vif values. Therefore, our final linear regression model does not have multi-collinearity issue.

Getting ready for preformance test

Testset transformation - the same transformations as trainset

*For any transformation explanations, refer to previous trainset feature engineering section.

for (i in 1:ncol(testset)){
  if(colnames(testset)[i]!='HC01_VC128' & colnames(testset)[i]!='Target'){
    testset[,paste0("flag_",colnames(testset)[i])]<-as.factor((ifelse(is.na(testset[,i]),1,0)))
  }
}

testset<-as.data.frame(testset%>%select(-HC01_VC128))

#capping variables using trainset capping values
for (i in 1:ncol(testset)){
  if ((substr(colnames(testset)[i],1,5)=='Weat_'|substr(colnames(testset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('testset$',colnames(testset)[i]))))=='numeric'|class(eval(parse(text=paste0('testset$',colnames(testset)[i]))))=='integer')){
    testset[,paste0(colnames(testset)[i],'_cap')]<-pmin(pmax(testset[,i],capping[1,colnames(testset)[i]]),capping[2,colnames(testset)[i]])
  }
}

testset<-as.data.frame(testset%>%select(Target,ends_with('_cap'),starts_with('flag_'),contains('_Type'),STATE.x,ELEVATION))

testset<-as.data.frame(testset%>%left_join(weat_state_avg,by='STATE.x'))

#Impute house characteristics data first with k nearest neighbor
HC_01<-testset%>%select(starts_with('HC01_'))
HC_01_Imputed<-knnImputation(HC_01,k=3)
names(HC_01_Imputed)<-paste0('Impute_',colnames(HC_01_Imputed))

#Imput weather data with mode and state average
for (i in 1:ncol(testset)){
  if((class(testset[,i])=='character'|class(testset[,i])=='factor')&substr(colnames(testset)[i],1,5)!='flag_'){   #Impute Weather type data and factors
    testset[,paste0("F_",colnames(testset)[i])]<-impute(as.factor(testset[,i]),mode)
  }
  else if((substr(colnames(testset)[i],1,5)=='Weat_')&(class(testset[,i])=='numeric'|class(testset[,i])=='integer')&(substr(colnames(testset)[i],1,10)!='state_avg_')){ #Impute Weather numeric data
    testset[,i]<-as.numeric(testset[,i])
    testset[,paste0("Impute_",colnames(testset)[i])]=coalesce(testset[,i],eval(parse(text=paste0('testset$state_avg_',colnames(testset)[i]))))#Impute state average
  }
}

testset<-cbind(testset,HC_01_Imputed)
testset<-as.data.frame(testset%>%select(Target,starts_with('flag_'),starts_with('Impute_'),starts_with('F_'),ELEVATION))

testset$Regions<-as.factor(ifelse(testset$F_STATE.x%in%c('NT','WY','CO','ND','SD','NE','KS'),'northwest',ifelse(testset$F_STATE.x%in%c('MN','IA','IL','WI','MO','OH','IN','MI'),'midwest',ifelse(testset$F_STATE.x%in%c('AZ','NM','TX','OK'),'southwest',ifelse(testset$F_STATE.x%in%c('WA','OR','NV','ID','UT','CA'),'westcoast',ifelse(testset$F_STATE.x%in%c('AR','LA','MS','TN','AL','GA','FL','SC','NC','VA','WV','KY'),'south',('northeast')))))))

# testset<-testset%>%select(-F_STATE.x)
testset$Impute_Weat_Sep_prcp_g1_cap<-pmax(-0.08290098-testset$Impute_Weat_Sep_prcp_cap,0)
testset$Impute_Weat_Sep_prcp_g2_cap<-pmax(testset$Impute_Weat_Sep_prcp_cap+0.08290098,0)

testset<-testset%>%select(-Impute_Weat_Sep_prcp_cap)

colclasses <- sapply(testset,class)
numColumns <- which(colclasses=="numeric"|colclasses=="integer")
JustNumbers<-testset%>%select(numColumns)

preObj <- preProcess(JustNumbers[, -1], method=c("center", "scale"))
scaled.JustNumbers <- predict(preObj, JustNumbers[, -1])

OthrNumbers<-testset%>%select(-numColumns)

set<-cbind(testset$Target,OthrNumbers,scaled.JustNumbers)
names(set)[1]<-'Target'

testset<-Filter(function(x)(length(unique(x))>1), testset)
# testset$Target<-log(testset$Target)

x <- model.matrix(Target~., testset)[,-1]
y<-testset$Target

x<-as.data.frame(x)
colnames(x)<-make.names(names(as.data.frame(x)), unique = FALSE, allow_ = TRUE)
testset<-cbind(y,as.data.frame(x))
names(testset)[1]<-'Target'
Predict home price on testset
testset$glm_pred<-round(predict(base.model4,newdata=testset))
Rescale prediction to the dependent variable mean

As we standardized our independent variables to mean=0, we need to rescale our prediction back to the actual dependent variable mean level for the next step.

testset$glm_pred_rescaled<-(testset$glm_pred)*(mean(exp(testset$Target))/mean(testset$glm_pred))
Plot Gains Chart

*GAINS.CHART function is a self defined function. The function uses predicted dependent variables quantiles as x-axis, and plot the actual dependent variable on y-axis. A good gains chart should show a nice upward trend with good KS value, and good percentage gains between the highest group and the lowest group.

GAINS.CHART((exp(testset$Target)),(testset$glm_pred_rescaled),n.rank=20)

## $gainschart
##    Rank Actual_Sum Actual_Mean Pred_Sum Pred_Mean Count
## 1     1    7845200    65926.05  9025484  75844.40   119
## 2     2    8409900    71879.49 11009493  94098.23   117
## 3     3    9184900    77838.14 11952935 101296.06   118
## 4     4   10491600    85996.72 13119060 107533.28   122
## 5     5   10599900    93804.42 12762194 112939.77   113
## 6     6   11522100    95223.97 14223204 117547.14   121
## 7     7   12095600   105179.13 14134706 122910.49   115
## 8     8   13959100   115364.46 15574630 128715.95   121
## 9     9   13865000   118504.27 15754182 134651.13   117
## 10   10   15564000   126536.59 17238195 140147.93   123
## 11   11   14659800   133270.91 16065361 146048.74   110
## 12   12   16702800   141549.15 18075568 153182.78   118
## 13   13   18288600   153685.71 19249041 161756.64   119
## 14   14   19780200   169061.54 19985776 170818.60   117
## 15   15   20999100   181026.72 20935289 180476.63   116
## 16   16   22843200   193586.44 22540707 191022.94   118
## 17   17   25399000   209909.09 24807141 205017.70   121
## 18   18   26717600   234364.91 25620873 224744.50   114
## 19   19   33790400   283952.94 30497190 256278.91   119
## 20   20   60595500   517910.26 40742472 348226.25   117
## 
## $ks
## [1] 21.29
## 
## $gini
## [1] 58.76

Lorenz Curve, Train vs Test

Lorenz Curve is commonly used to check model overfitting

#predict and rescale on trainset
trainset4$glm_pred<-round(predict(base.model4,newdata=trainset4))
trainset4$glm_pred_rescaled<-(trainset4$glm_pred)*(mean(exp(trainset4$Target))/mean(trainset4$glm_pred))

##Lorenz Curve
trainCheck<- OrderedCDF(Score = trainset4$glm_pred, Loss = exp(trainset4$Target), Weight =rep(1,nrow(trainset4)), NBins = 1000)
ValidationCheck<- OrderedCDF(Score = testset$glm_pred, Loss = exp(testset$Target), Weight =rep(1,nrow(testset)), NBins = 1000)

ValidationWeight <- ValidationCheck$WeightPts
ValidationLoss <- ValidationCheck$LossPts
TrainWeight <- trainCheck$WeightPts
TrainLoss <- trainCheck$LossPts

ModelComparisonData <- data.frame(ValidationWeight, ValidationLoss, TrainWeight, TrainLoss)
ModelComparisonData %>% ggvis(~ValidationWeight, ~ValidationLoss) %>% layer_paths(stroke := "blue") %>% 
  layer_paths(data = ModelComparisonData, x = ~ValidationWeight, y = ~ValidationWeight, stroke := "black") %>%
  layer_paths(data = ModelComparisonData, x = ~TrainWeight, y = ~TrainLoss, stroke := "red")

The above chart shows lorenz curve for Trainset (redline) and Testset (blueline). Further away the curves are from the diagonal line means better prediction power. From the chart, our testset performance was even better than trainset, therefore, no overfitting problem presents in the linear regression model.

5. Gradient boosting

Trainset Variable Transformation for gradient boosting

#Train set
#Create missing flags
for (i in 1:ncol(gbm.trainset)){
  if(colnames(gbm.trainset)[i]!='HC01_VC128' & colnames(gbm.trainset)[i]!='Target'){
    gbm.trainset[,paste0("flag_",colnames(gbm.trainset)[i])]<-as.factor((ifelse(is.na(gbm.trainset[,i]),1,0)))
  }
}

#boxplot shows outliers for dependent variable
ggplot(data=gbm.trainset,aes(x=1,y=gbm.trainset$HC01_VC128))+geom_boxplot()+xlab('Boxplot')+ylab('')

#tranform and cap dependent variable
gbm.trainset$Target<-pmin(gbm.trainset$HC01_VC128,quantile(gbm.trainset$HC01_VC128,0.99))
gbm.trainset$Target<-gbm.trainset$HC01_VC128
gbm.trainset<-as.data.frame(gbm.trainset%>%select(-HC01_VC128))

#Remove independent variable outliers
for (i in 1:ncol(gbm.trainset)){
if ((substr(colnames(gbm.trainset)[i],1,5)=='Weat_'|substr(colnames(gbm.trainset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))))=='numeric'|class(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))))=='integer')){
  gbm.trainset[,i]<-pmin(pmax(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))),quantile(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))),0.05,na.rm=TRUE)),quantile(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))),0.95,na.rm=TRUE))
}
}

#save a capping value table for gbm.testset transformation later
gbm.trainset.num<-gbm.trainset[,sapply(gbm.trainset,class)=='numeric'|sapply(gbm.trainset,class)=='integer']
capping.gbm<-as.matrix(rbind(colnames(gbm.trainset.num),sapply(gbm.trainset.num,function(x) quantile(x,0.05,na.rm=TRUE)),sapply(gbm.trainset.num,function(x) quantile(x,0.95,na.rm=TRUE))))
colnames(capping.gbm)<-capping.gbm[1,]
capping.gbm<-as.data.frame(capping.gbm[2:3,])

#Convert character variables to factors
for (i in 1:ncol(gbm.trainset)){
  if(class(gbm.trainset[,i])=="character"){   
    gbm.trainset[,i]<-as.factor(gbm.trainset[,i])
  } else {
    gbm.trainset[,i]=gbm.trainset[,i]
  }
}

#select useful columns 
gbm.trainset<-as.data.frame(gbm.trainset%>%select(Target,starts_with('Weat_'),starts_with('flag_'),starts_with('HC01_'),STATE.x,ELEVATION))

Testset Variable Transformation for gradient boosting

#Create missing flags
for (i in 1:ncol(gbm.testset)){
  if(colnames(gbm.testset)[i]!='HC01_VC128' & colnames(gbm.testset)[i]!='Target'){
    gbm.testset[,paste0("flag_",colnames(gbm.testset)[i])]<-as.factor((ifelse(is.na(gbm.testset[,i]),1,0)))
  }
}

#Create Target variable
gbm.testset$Target<-gbm.testset$HC01_VC128
gbm.testset<-as.data.frame(gbm.testset%>%select(-HC01_VC128))

#capping variables using gbm.trainset capping values
for (i in 1:ncol(gbm.testset)){
  if ((substr(colnames(gbm.testset)[i],1,5)=='Weat_'|substr(colnames(gbm.testset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('gbm.testset$',colnames(gbm.testset)[i]))))=='numeric'|class(eval(parse(text=paste0('gbm.testset$',colnames(gbm.testset)[i]))))=='integer')){
    gbm.testset[,paste0(colnames(gbm.testset)[i])]<-pmin(pmax(gbm.testset[,i],capping.gbm[1,colnames(gbm.testset)[i]]),capping.gbm[2,colnames(gbm.testset)[i]])
  }
}

#Convert character variables to factors
for (i in 1:ncol(gbm.testset)){
  if((class(gbm.testset[,i])=='character')){   
    gbm.testset[,i]<-as.factor(gbm.testset[,i])
  }
}

#select useful columns 
gbm.testset<-as.data.frame(gbm.testset%>%select(Target,starts_with('Weat_'),starts_with('flag_'),starts_with('HC01_'),STATE.x,ELEVATION))

Run gradient boosting model

As parameter tuning is time-consuming, R markdown file documented the codes as below, however, cross-validation parameter tuning was run separately from this file with the following codes.

# ctrl <- trainControl(method = "repeatedcv",
#                      number = 5,
#                      # repeats = 5,
#                      summaryFunction=defaultSummary)
# 
# 
# set.seed(5627)
# 
# gbmGrid <-  expand.grid(interaction.depth = c(2,3),
#                         n.trees = (1:3)*500, 
#                         shrinkage = c(0.1),
#                         n.minobsinnode = 100) 
# 
# gbm_fit <- train(Target~.,
#                  data = trainset,
#                  method = "gbm",
#                  bag.fraction = 0.5,
#                  verbose = FALSE,
#                  metric = "RMSE",
#                  trControl = ctrl,
#                  tuneGrid =gbmGrid
# )
# 
# plot(gbm_fit)
# summary(gbm_fit)
# gbm_fit$bestTune

#run gbm with tuned parameters##############
set.seed(12345)
gbm_tuned <- gbm(Target~.,
                 distribution = "gaussian",
                 data = gbm.trainset,
                 n.trees = 1500,# convergence issue, just need enough to converge
                 shrinkage = 0.1,  # convergence issue, lower is better but takes longer.
                 interaction.depth = 3, # effects tree complexity
                 n.minobsinnode = 200, # tree argument 1082/bag.fraction/train.fraction
                 bag.fraction = 0.5, #optimization argument
                 train.fraction = 1.0,
                 cv.folds=5, #gives condition on how good a model is doing
                 keep.data = TRUE,
                 verbose = TRUE, #"CV",
                 class.stratify.cv=NULL
)
best.iter <- gbm.perf(gbm_tuned, method="cv")

Gains Chart

#Predict on trainset and testset
gbm.trainset$gbm_pred<- predict.gbm(gbm_tuned, n.trees = best.iter,newdata = gbm.trainset)
gbm.testset$gbm_pred <- predict.gbm(gbm_tuned, n.trees = best.iter,newdata = gbm.testset)

#Gains Chart on testset
GAINS.CHART(gbm.testset$Target,gbm.testset$gbm_pred,n.rank=20)

## $gainschart
##    Rank Actual_Sum Actual_Mean Pred_Sum Pred_Mean Count
## 1     1    7132200    60442.37  6211418  52639.14   118
## 2     2    8247200    69891.53  8043978  68169.30   118
## 3     3    8929800    75676.27  8930668  75683.62   118
## 4     4    9549900    81623.08  9617977  82204.93   117
## 5     5   10175500    86233.05 10377220  87942.55   118
## 6     6   11151800    94506.78 10971049  92974.99   118
## 7     7   11485900    98170.09 11543818  98665.11   117
## 8     8   12691500   107555.08 12440965 105431.91   118
## 9     9   13120100   111187.29 13256961 112347.13   118
## 10   10   14265800   120896.61 14213748 120455.49   118
## 11   11   14716500   125782.05 15053535 128662.69   117
## 12   12   16583000   140533.90 16204647 137327.52   118
## 13   13   17664000   149694.92 17454559 147919.99   118
## 14   14   18936600   161851.28 18823928 160888.27   117
## 15   15   20793000   176211.86 20667931 175151.95   118
## 16   16   22572300   191290.68 22792007 193152.61   118
## 17   17   25909800   221451.28 25666247 219369.63   117
## 18   18   29003300   245790.68 29702521 251716.28   118
## 19   19   38269500   324317.80 37794034 320288.42   118
## 20   20   62115800   526405.08 57588786 488040.56   118
## 
## $ks
## [1] 23.29
## 
## $gini
## [1] 63.75

Lorenz Curve, Train vs Test

##Lorenz Curve
trainCheck<- OrderedCDF(Score = gbm.trainset$gbm_pred, Loss = (gbm.trainset$Target), Weight =rep(1,nrow(gbm.trainset)), NBins = 1000)
ValidationCheck<- OrderedCDF(Score = gbm.testset$gbm_pred, Loss = (gbm.testset$Target), Weight =rep(1,nrow(gbm.testset)), NBins = 1000)

ValidationWeight <- ValidationCheck$WeightPts
ValidationLoss <- ValidationCheck$LossPts
TrainWeight <- trainCheck$WeightPts
TrainLoss <- trainCheck$LossPts

ModelComparisonData <- data.frame(ValidationWeight, ValidationLoss, TrainWeight, TrainLoss)
ModelComparisonData %>% ggvis(~ValidationWeight, ~ValidationLoss) %>% layer_paths(stroke := "blue") %>% 
  layer_paths(data = ModelComparisonData, x = ~ValidationWeight, y = ~ValidationWeight, stroke := "black") %>%
  layer_paths(data = ModelComparisonData, x = ~TrainWeight, y = ~TrainLoss, stroke := "red")

6. Conclusion

Based on the analysis, we see a significant relationship between weather related attributes and home pices. We can make some preliminary conclusions based on the analysis:

  1. Home price is higher when precipitation is lower, which is intuitive, as high summer precipitation may cause flood. People prefer cooler weather in summer, which drives housing demand, and increase the home price, which is consistent with our results that home price is higher when summer tempurature is lower. Based on the analysis, winter tempurature does not have as big of an impact to the home price as summer tempurater.

  2. Geographic location has a big impace on home price, which is also intuitive. This project used ‘State’ to model the impact of geographic location on home price. For future improvement, regional factors can be considered as the U.S. is typically grouped into climate regions based on similar weather patterns.

  3. By comparing different models, gradient boosting model is our chosen model for this project, as it performed better than liner regression with better gains chart, and higher KS score. Gradient boosing was able to catch the non-linear relationship between our independent variables and the dependent variable, such as precipitation. Also gradient boosting does not require as much feature engineering as linear regression, which is ideal for a time-constrained project. Binning and smoothing can help to address some non-linear relationships in linear regression. However, the project only looked into weather related variables for non-linear relationship transformation, which can be improved in future study.

7. Next Step

  1. Segmentation based on geographic regions

  2. Linear regression binning and smoothing

  3. Missing value imputation with multiple imputation

  4. Test the results on Holdout dataset.