Home price is a popular and well discussed topic, not only among home buyers, but also among investors, real-estate agencies, online listing platforms, etc. A traditional way to predict home prices is by analyzing house characteritics, such as location, square feet, building year, building materials, etc. However, with the increasing data availabilities, more complex and accurate models utilizing unconventional datasets or modeling techniques are emerging. This project aims to explore the relationship between home prices and various weather observations to better predict home prices.
path<-'/Users/xucao/Google Drive/MyWorkStation/Projects/ODG-Weather-HousingPrice/'
dataloc<-paste0(path,'DATA/')
codeloc<-paste0(path,'CODE/')
output <-paste0(path,'OUTPUT/')
source(paste0(codeloc,"functions.r"))
source(paste0(codeloc,"attr.bivar.R")) #self defined bivariate plot function
housing_data<-fread(paste0(dataloc,'housing_data_ACS_15_5YR_DP04_with_ann.csv'),header=TRUE)
#rename zip code column
colnames(housing_data)[2]<-'ZIP'
#convert zip code column to numeric
housing_data$ZIP=as.numeric((housing_data$ZIP))
#drop the top row with variables descriptions
housing_data<-housing_data[2:nrow(housing_data),]
Gaz_zcta_national<-fread(paste0(dataloc,'2015_Gaz_zcta_national.txt'),header=TRUE)
colnames(Gaz_zcta_national)[1]<-'ZIP'
#convert integer64 columns to numeric
Gaz_zcta_national$ALAND<-as.numeric(Gaz_zcta_national$ALAND)
Gaz_zcta_national$AWATER<-as.numeric(Gaz_zcta_national$AWATER)
weather_allstations<-read_fwf(file=paste0(dataloc,'weather_allstations.txt'),fwf_widths(c(12,9,10,7,3,31,4,4,6)))
#renaming columns
names(weather_allstations)<-c('STATION_ID','LAT','LONG','ELEVATION','STATE','STATION_NAME','SOURCE1','SOURCE2','ZIP_STATION')
#subset useful variables
weather_allstations<-weather_allstations[,c('STATION_ID','ELEVATION','STATE')]
weather_zipcodes_stations<-read_fwf(file=paste0(dataloc,'weather_zipcodes-normals-stations.txt'),fwf_widths(c(12,6,50)))
#Renaming columns
names(weather_zipcodes_stations)<-c('STATION_ID','ZIP','CITY')
#Formatting column 'CITY' and 'ZIP', all 'ZIP' columns are converted to numeric for matching purposes
weather_zipcodes_stations$CITY<-toupper(weather_zipcodes_stations$CITY)
weather_zipcodes_stations$ZIP<-as.numeric(weather_zipcodes_stations$ZIP)
#Create primary key
weather_zipcodes_stations$STATION<-paste0(formatC(weather_zipcodes_stations$ZIP, width=5, flag="0"),toupper(weather_zipcodes_stations$CITY))
namelist<-c('STATION_ID','Weat_Jan','Weat_Jan_Type','Weat_Feb','Weat_Feb_Type','Weat_Mar','Weat_Mar_Type','Weat_Apr','Weat_Apr_Type','Weat_May','Weat_May_Type',
'Weat_Jun','Weat_Jun_Type','Weat_Jul','Weat_Jul_Type','Weat_Aug','Weat_Aug_Type','Weat_Sep','Weat_Sep_Type','Weat_Oct','Weat_Oct_Type','Weat_Nov','Weat_Nov_Type',
'Weat_Dec','Weat_Dec_Type')
weather_prcp<-read_fwf(file=paste0(dataloc,'weather_mly-prcp-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
weather_tavg<-read_fwf(file=paste0(dataloc,'weather_mly-tavg-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
weather_tmax<-read_fwf(file=paste0(dataloc,'weather_mly-tmax-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
weather_tmin<-read_fwf(file=paste0(dataloc,'weather_mly-tmin-normal.txt'),fwf_widths(c(17,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1,6,1)))
names(weather_prcp)<-paste0(namelist,'_prcp')
names(weather_prcp)[1]<-'STATION_ID'
names(weather_tavg)<-paste0(namelist,'_tavg')
names(weather_tavg)[1]<-'STATION_ID'
names(weather_tmax)<-paste0(namelist,'_tmax')
names(weather_tmax)[1]<-'STATION_ID'
names(weather_tmin)<-paste0(namelist,'_tmin')
names(weather_tmin)[1]<-'STATION_ID'
#precipitation cannot be negative, so we convert negative values to NA
for (i in 1:ncol(weather_prcp)){
if (class(weather_prcp[,i])=='numeric'){
weather_prcp[,i]<-ifelse(weather_prcp[,i]<0,'NA',weather_prcp[,i])
}
}
clpcdy15<-separate(read_fwf(file=paste0(dataloc,'clpcdy15.txt'),fwf_widths(c(38,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4)),skip=2),col='X1',into=c('STATION','STATE'),sep=',' )
names(clpcdy15)=c('STATION','STATE','YRS','Jan_CL','Jan_PC','Jan_CD','Feb_CL','Feb_PC','Feb_CD','Mar_CL','Mar_PC','Mar_CD','Apr_CL','Apr_PC','Apr_CD','May_CL','May_PC','May_CD','Jun_CL','Jun_PC','Jun_CD','Jul_CL','Jul_PC','Jul_CD','Aug_CL','Aug_PC','Aug_CD','Sep_CL','Sep_PC','Sep_CD','Oct_CL','Oct_PC','Oct_CD','Nov_CL','Nov_PC','Nov_CD','Dec_CL','Dec_PC','Dec_CD','Ann_CL','Ann_PC','Ann_CD')
#Base on Master Location Identifier Database downloaded from 'http://www.weathergraphics.com/identifiers/', WBAN is the weather station identifiersm. The first column of dataset clpcdy15 and pctpos15 consists of WBAN and CITY name.
clpcdy15$WBAN<-as.numeric(substr(clpcdy15$STATION,1,5))
MLID<-fread(file=paste0(dataloc,'MLID.csv'),header=TRUE)
MLID<-MLID%>%select(WBAN, (CITY))
MLID$CITY<-toupper(MLID$CITY)
clpcdy15<-clpcdy15%>%left_join(MLID,by='WBAN')
clpcdy15$CITY<-coalesce(clpcdy15$CITY,substr(clpcdy15$STATION, 6, 50))
clpcdy15$STATION=paste0(clpcdy15$WBAN,clpcdy15$CITY)
pctpos15<-separate(read_fwf(file=paste0(dataloc,'pctpos15.txt'),fwf_widths(c(37,16,6,6,6,6,6,6,6,6,6,6,6,6,6)),skip=2),col='X1',into=c('STATION','STATE'),sep=',' )
names(pctpos15)=c('STATION','STATE','POR','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC','ANN')
pctpos15$WBAN<-as.numeric(substr(pctpos15$STATION,1,5))
pctpos15$CITY<-substr(pctpos15$STATION, 6, 50)
sapply(housing_data, class)
#We only keep the actual values (HC01), and drop other statistics columns, such as margin of error, percentage, etc.
housing_data<-housing_data%>%select(ZIP,starts_with('HC01_'))
#convert to numerica data
housing=as.data.frame(sapply(housing_data, as.numeric))
sapply(Gaz_zcta_national, class)
sapply(weather_allstations, class)
weather_allstations$ELEVATION<-as.numeric(weather_allstations$ELEVATION)
sapply(weather_zipcodes_stations, class)
sapply(weather_prcp, class)
sapply(weather_tavg, class)
#convert character tempurature data into numeric data
weather_tavg$Weat_Jul_tavg<-as.numeric(weather_tavg$Weat_Jul_tavg)
weather_tavg$Weat_Aug_tavg<-as.numeric(weather_tavg$Weat_Aug_tavg)
sapply(weather_tmax, class)
#convert character tempurature data into numeric data
weather_tmax$Weat_May_tmax<-as.numeric(weather_tmax$Weat_May_tmax)
weather_tmax$Weat_Jun_tmax<-as.numeric(weather_tmax$Weat_Jun_tmax)
weather_tmax$Weat_Jul_tmax<-as.numeric(weather_tmax$Weat_Jul_tmax)
weather_tmax$Weat_Aug_tmax<-as.numeric(weather_tmax$Weat_Aug_tmax)
weather_tmax$Weat_Sep_tmax<-as.numeric(weather_tmax$Weat_Sep_tmax)
sapply(weather_tmin, class)
sapply(clpcdy15, class)
clpcdy15<-cbind(clpcdy15[1:3], lapply(clpcdy15[4:42], as.numeric) ,clpcdy15[43:44])
sapply(pctpos15, class)
clpcdy15$STATION[duplicated(clpcdy15$STATION)]
#dedup on primary key
clpcdy15<-clpcdy15[!duplicated(clpcdy15$STATION),]
pctpos15$STATION[duplicated(pctpos15$STATION)]
pctpos15<-pctpos15[!duplicated(pctpos15$STATION),]
housing$ZIP[duplicated(housing$ZIP)]
Gaz_zcta_national$ZIP[duplicated(Gaz_zcta_national$ZIP)]
weather_allstations$STATION_ID[duplicated(weather_allstations$STATION_ID)]
weather_zipcodes_stations$STATION_ID[duplicated(weather_zipcodes_stations$STATION_ID)]
weather_prcp$STATION_ID[duplicated(weather_prcp$STATION_ID)]
weather_tavg$STATION_ID[duplicated(weather_tavg$STATION_ID)]
weather_tmax$STATION_ID[duplicated(weather_tmax$STATION_ID)]
weather_tmin$STATION_ID[duplicated(weather_tmin$STATION_ID)]
data<-weather_zipcodes_stations%>%left_join(weather_allstations,by='STATION_ID')%>%left_join(weather_prcp,by='STATION_ID')%>%left_join(weather_tavg,by='STATION_ID')%>%left_join(weather_tmax,by='STATION_ID')%>%left_join(weather_tmin,by='STATION_ID')%>%left_join(housing,by='ZIP')%>%left_join(Gaz_zcta_national,by='ZIP')%>%left_join(pctpos15,by='STATION')%>%left_join(clpcdy15,by='STATION')
summary(data$HC01_VC128)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13000 88200 123700 159727 181300 1732700 371
#drop records with NA target, these records without target informaiton are not useful to our supervised learning analysis.
data<-data[!is.na(data$HC01_VC128),]
#simulate gamma distribution
theta=(sd(log(data$HC01_VC128)))^2/mean(log(data$HC01_VC128))
k=mean(log(data$HC01_VC128))/theta
gamma<-as.data.frame(rgamma(nrow(data),shape=k,scale=theta))
gamma$source<-'gamma'
names(gamma)=c('value','source')
#simulate normal distribution
mean<-mean(log(data$HC01_VC128))
sd<-sd(log(data$HC01_VC128))
norm<-as.data.frame(rnorm(nrow(data),mean=mean,sd=sd))
norm$source='norm'
names(norm)=c('value','source')
#plot actual dependent variable distribution
actual<-as.data.frame(log(data$HC01_VC128))
actual$source='actual'
names(actual)=c('value','source')
dist<-rbind(gamma,norm,actual)
dist$value<-as.numeric(dist$value)
ggplot(dist,aes(value, fill=source))+geom_density(alpha=0.5)
#based on the above analysis, by comparing the distribution shape, a log normal distribution will be used for the target.
data$Target<-log(data$HC01_VC128)
This step is to prepare data for future modeling practice, and we only conduct featue engineering and exploratory analysis on the training set.
set.seed(1)
inTrainingSet = createDataPartition(log(data$HC01_VC128),p=.5, list = FALSE)
dt = data[inTrainingSet,]
trainset = data[-inTrainingSet,]
set.seed(123)
inTrainingSet = createDataPartition(log(dt$HC01_VC128),p=.5, list = FALSE)
holdout = dt[inTrainingSet,]
testset = dt[-inTrainingSet,]
gbm.trainset<-trainset
gbm.testset<-testset
Check dependent variable distributions in different datasets. We are looking for similar dependent variable distributions among train/test/holdout sets to ensure that we do not have biased sample selection.
##train number of records:
nrow(trainset)
## [1] 4710
summary(log(trainset$HC01_VC128))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.473 11.387 11.726 11.793 12.108 14.362
train_target_dist<-cbind(log(trainset$HC01_VC128),'trainset')
##test number of records:
nrow(testset)
## [1] 2355
summary(log(testset$HC01_VC128))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.10 11.38 11.72 11.79 12.11 14.33
test_target_dist<-cbind(log(testset$HC01_VC128),'testset')
##holdout number of records:
nrow(holdout)
## [1] 2358
summary(log(holdout$HC01_VC128))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.473 11.387 11.726 11.799 12.107 14.365
holdout_target_dist<-cbind(log(holdout$HC01_VC128),'holdout')
target_dist<-as.data.frame(rbind(train_target_dist,test_target_dist,holdout_target_dist))
names(target_dist)<-c('value','source')
target_dist$value<-as.numeric(target_dist$value)
ggplot(target_dist,aes(value, fill=source))+geom_density(alpha=0.5)
Before imputing missing values, we create missing flags and cap outliers first, as missing flags carry meaningful information, and outliers can impact missing value imputation.
for (i in 1:ncol(trainset)){
if(colnames(trainset)[i]!='HC01_VC128' & colnames(trainset)[i]!='Target'){
trainset[,paste0("flag_",colnames(trainset)[i])]<-as.factor((ifelse(is.na(trainset[,i]),1,0)))
}
}
Due to the constrained time, here we use 5% and 95% quantiles to cap all the attributes. For future study, we can run chi-square outlier test and treat each attribute separately for outlier capping.
#boxplot shows outliers for dependent variable
ggplot(data=trainset,aes(x=1,y=trainset$HC01_VC128))+geom_boxplot()+xlab('Boxplot')+ylab('')
#tranform and cap dependent variable
trainset$Target<-pmin(log(trainset$HC01_VC128),quantile(log(trainset$HC01_VC128),0.99))
trainset<-trainset%>%select(-HC01_VC128)
#cap numeric independent variable
for (i in 1:ncol(trainset)){
if ((substr(colnames(trainset)[i],1,5)=='Weat_'|substr(colnames(trainset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))))=='numeric'|class(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))))=='integer')){
trainset[,paste0(colnames(trainset)[i],'_cap')]<-pmin(pmax(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))),quantile(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))),0.05,na.rm=TRUE)),quantile(eval(parse(text=paste0('trainset$',colnames(trainset)[i]))),0.95,na.rm=TRUE))
}
}
#save a capping value table for testset transformation later
trainset.num<-trainset[,sapply(trainset,class)=='numeric'|sapply(trainset,class)=='integer']
capping<-as.matrix(rbind(colnames(trainset.num),sapply(trainset.num,function(x) quantile(x,0.05,na.rm=TRUE)),sapply(trainset.num,function(x) quantile(x,0.95,na.rm=TRUE))))
colnames(capping)<-capping[1,]
capping<-as.data.frame(capping[2:3,])
Regarding to missing values, we treat variables from different datasets differently.
For weather related variables, climate is highy correlated with location, and areas within the same state share similar climate, therefore one way to impute the weather data is to use the state mean, and we have fully populated state information in the dataset, which makes this approach plausible.
For house related variables, we can try two different approches, one is to utilize bivariate plot to go through the relationships between each indenpent variable and independent variables, and impute the missing values with user defined statistical method (mean, min, max, zero, etc.), or we can use k nearest neighor to impute the missing values. Here we use k-nearest neighbor to impute our housing variables.
Select useful columns
trainset<-as.data.frame(trainset%>%select(Target,ends_with('_cap'),starts_with('flag_'),contains('_Type'),ELEVATION,STATE.x,ALAND,AWATER))
Seperate different data type columns
colclasses <- sapply(trainset,class)
numColumns <- which(colclasses=="numeric"|colclasses=="integer")
JustNumbers<-trainset%>%select(STATE.x,numColumns)
Create state average lookup table for weather related attributes imputation
weat_state<-list()
for (i in seq_along(1:(ncol(JustNumbers)))){
if (i==1){
weat_state[[i]]<-as.vector((JustNumbers%>%select(STATE.x,colnames(JustNumbers)[i])%>%group_by(STATE.x)%>%summarise(avg=mean(eval(parse(text=colnames(JustNumbers)[i])),na.rm=TRUE)))[,1])
} else{
weat_state[[i]]<-as.vector((JustNumbers%>%select(STATE.x,colnames(JustNumbers)[i])%>%group_by(STATE.x)%>%summarise(avg=mean(eval(parse(text=colnames(JustNumbers)[i])),na.rm=TRUE)))[,2])
}
}
weat_state_avg<-matrix(nrow=51)
for (i in 1:length(weat_state)){
weat_state_avg<-cbind(weat_state_avg,(unlist(weat_state[i])))
}
weat_state_avg<-as_tibble((weat_state_avg))[,2:ncol(weat_state_avg)]
weat_state_avg<-cbind(weat_state_avg[,1],sapply(weat_state_avg[,2:ncol(weat_state_avg)],as.numeric))
names(weat_state_avg)=c('STATE.x',paste0('state_avg_',colnames(JustNumbers)[2:ncol(JustNumbers)]))
trainset<-as.data.frame(trainset%>%left_join(weat_state_avg,by='STATE.x'))
Attributes imputation
#Impute house characteristics data first with k nearest neighbor
HC_01<-trainset%>%select(starts_with('HC01_'))
HC_01_Imputed<-knnImputation(HC_01,k=3)
names(HC_01_Imputed)<-paste0('Impute_',colnames(HC_01_Imputed))
#Imput weather factor data with mode and numeric data with state average
for (i in 1:ncol(trainset)){
if((class(trainset[,i])=='character'|class(trainset[,i])=='factor')&substr(colnames(trainset)[i],1,5)!='flag_'){ #Impute Weather type data and factors
trainset[,i][trainset[,i] == 'P'] <- NA
trainset[,paste0("F_",colnames(trainset)[i])]<-impute(as.factor(trainset[,i]),mode)
}
else if((substr(colnames(trainset)[i],1,5)=='Weat_')&(class(trainset[,i])=='numeric'|class(trainset[,i])=='integer')&(substr(colnames(trainset)[i],1,10)!='state_avg_')){ #Impute Weather numeric data
trainset[,i]<-as.numeric(trainset[,i])
trainset[,paste0("Impute_",colnames(trainset)[i])]=coalesce(trainset[,i],eval(parse(text=paste0('trainset$state_avg_',colnames(trainset)[i]))))#Impute state average
}
}
trainset<-cbind(trainset,HC_01_Imputed)
trainset<-as.data.frame(trainset%>%select(Target,starts_with('flag_'),starts_with('Impute_'),starts_with('F_'),ELEVATION, ALAND, AWATER))
Climate is highly impacted by the geograpic feature of the location. However, how granular our data needs to be regarding to geographic feature is debatable. In this project, we include both State and Climate regions information in our model.
trainset$Regions<-as.factor(ifelse(trainset$F_STATE.x%in%c('NT','WY','CO','ND','SD','NE','KS'),'northwest',ifelse(trainset$F_STATE.x%in%c('MN','IA','IL','WI','MO','OH','IN','MI'),'midwest',ifelse(trainset$F_STATE.x%in%c('AZ','NM','TX','OK'),'southwest',ifelse(trainset$F_STATE.x%in%c('WA','OR','NV','ID','UT','CA'),'westcoast',ifelse(trainset$F_STATE.x%in%c('AR','LA','MS','TN','AL','GA','FL','SC','NC','VA','WV','KY'),'south',('northeast')))))))
# trainset<-trainset%>%select(-F_STATE.x,)
After feature engineering, now we can look at some of the important weather attributes, and examine their relationships with home price.
#spring months
bivar.plot(trainset,'Impute_Weat_Apr_prcp_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_May_prcp_cap','Target',n.rank=50)
#summer months
bivar.plot(trainset,'Impute_Weat_Jun_prcp_cap','Target',n.rank=50)
#fall months
bivar.plot(trainset,'Impute_Weat_Sep_prcp_cap','Target',n.rank=50)
#winter months
bivar.plot(trainset,'Impute_Weat_Jan_prcp_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Feb_prcp_cap','Target',n.rank=50)
From the above charts, we see that higher precipitation areas has lower home price in summer months, however higher home price in winter months. One interesting observation about this attribute is that in September, precipitation has a nonlinear relationship with home price.
To deal with this relationship, we can create two variables to represent different trends in different range. Here we use WOE to find the cutoff point.
WOE formula:
WOE=ln(\(\frac{p(non-event)}{p(event)}\))
IV=\(\sum_{i=1}^n (DistributionGood-DistributionBad)\)*WOE
WOE_numeric_split(x='Impute_Weat_Sep_prcp_cap',y1='Target',data=trainset,group=10)
## $WOE
## # A tibble: 7 x 14
## grp n_obs mean_x sum_y1 sum_y0 sum_w mean_y1 mean_y0 LowerBound
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 470 56.2 5737. 5541. 470 12.2 11.8 45
## 2 2 469 116. 5593. 5530. 469 11.9 11.8 89
## 3 3 473 162. 5578. 5577. 473 11.8 11.8 138
## 4 4 928 279. 10735. 10941. 928 11.6 11.8 199
## 5 5 955 348. 11150. 11260. 955 11.7 11.8 324
## 6 6 943 405. 11125. 11118. 943 11.8 11.8 375
## 7 7 472 497. 5613. 5565. 472 11.9 11.8 450
## # ... with 5 more variables: UpperBound <dbl>, woe <dbl>, ks <dbl>,
## # info <dbl>, plot.width <dbl>
##
## $Stat
## MAX_KS Info.Value Trend.Estimate Trend.Pr(>|t|)
## 4.698438e-01 2.337355e-04 -5.022986e-05 3.243015e-01
##
## $WOE_Code
## [1] "trainset$w_Impute_Weat_Sep_prcp_cap = trainset$Impute_Weat_Sep_prcp_cap"
## [2] "g_Impute_Weat_Sep_prcp_cap = c(45, 89, 138, 199, 324, 375, 450, 518)"
## [3] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 1] = 0.0347702723323038"
## [4] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 2] = 0.0113994413145294"
## [5] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 3] = 0.000260558517112001"
## [6] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 4] = -0.0190688743527721"
## [7] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 5] = -0.00979637863295558"
## [8] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 6] = 0.000641267961459832"
## [9] "trainset$w_Impute_Weat_Sep_prcp_cap[findInterval(trainset$Impute_Weat_Sep_prcp_cap, g_Impute_Weat_Sep_prcp_cap, rightmost.closed=TRUE) == 7] = 0.0086564356766699"
##
## $Plot
Seen from the above WOE graph, the trend changed from negative to positive at bin 4, and we can find the mean of the independent variable for bin 4 from the table, which is -0.08290098. So we use this value as our cutoff value.
trainset$Impute_Weat_Sep_prcp_g1_cap<-pmax(-0.08290098-trainset$Impute_Weat_Sep_prcp_cap,0)
trainset$Impute_Weat_Sep_prcp_g2_cap<-pmax(trainset$Impute_Weat_Sep_prcp_cap+0.08290098,0)
trainset<-trainset%>%select(-Impute_Weat_Sep_prcp_cap)
#spring months
bivar.plot(trainset,'Impute_Weat_Apr_tavg_cap','Target',n.rank=50)
#summer months
bivar.plot(trainset,'Impute_Weat_Jun_tavg_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Jul_tavg_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Aug_tavg_cap','Target',n.rank=50)
#winter months
bivar.plot(trainset,'Impute_Weat_Jan_tavg_cap','Target',n.rank=50)
From the above charts, we see that home price is higher when summer months tempurature is lower, winter months tempurature does not have as big of an impact to the home price as summer tempurater.
#spring months
bivar.plot(trainset,'Impute_Weat_Apr_tmax_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_May_tmax_cap','Target',n.rank=50)
#summer months
bivar.plot(trainset,'Impute_Weat_Jun_tmax_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Jul_tmax_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Aug_tmax_cap','Target',n.rank=50)
#winter months
bivar.plot(trainset,'Impute_Weat_Dec_tmax_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Jan_tmax_cap','Target',n.rank=50)
#spring months
bivar.plot(trainset,'Impute_Weat_May_tmin_cap','Target',n.rank=50)
#summer months
bivar.plot(trainset,'Impute_Weat_Jun_tmin_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Jul_tmin_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Aug_tmin_cap','Target',n.rank=50)
#winter months
bivar.plot(trainset,'Impute_Weat_Dec_tmin_cap','Target',n.rank=50)
bivar.plot(trainset,'Impute_Weat_Jan_tmin_cap','Target',n.rank=50)
trainset%>%group_by(Regions)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 6 x 2
## Regions avg_home_price
## <fct> <dbl>
## 1 midwest 11.7
## 2 northeast 12.2
## 3 northwest 11.6
## 4 south 11.6
## 5 southwest 11.5
## 6 westcoast 12.2
Based on the exploratory analysis, we can see that geographic location, house characteristics and weather data are all predictable. The home price varies depending on which state the house is located, the house characteristics, and the location tempurature and the percentage of snowfall.
As for each record, the recording type is the same for weather attributes, so here we only look at one recording type attribute as an example.
trainset%>%group_by(F_Weat_Jan_Type_prcp)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
## F_Weat_Jan_Type_prcp avg_home_price
## <fct> <dbl>
## 1 C 11.8
## 2 Q 11.9
## 3 R 11.8
## 4 S 11.7
trainset%>%group_by(F_Weat_Jan_Type_tavg)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
## F_Weat_Jan_Type_tavg avg_home_price
## <fct> <dbl>
## 1 C 11.8
## 2 Q 11.8
## 3 R 11.8
## 4 S 11.8
trainset%>%group_by(F_Weat_Jan_Type_tmax)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
## F_Weat_Jan_Type_tmax avg_home_price
## <fct> <dbl>
## 1 C 11.8
## 2 Q 11.8
## 3 R 11.8
## 4 S 11.8
trainset%>%group_by(F_Weat_Jan_Type_tmin)%>%summarise(avg_home_price=mean((Target)))
## # A tibble: 4 x 2
## F_Weat_Jan_Type_tmin avg_home_price
## <fct> <dbl>
## 1 C 11.9
## 2 Q 11.8
## 3 R 11.8
## 4 S 11.8
Standardize numeric attributes
As our weather and house data have very different scales, to better compare the coefficients of our models, we standardize the attributes before running models.
colclasses <- sapply(trainset,class)
numColumns <- which(colclasses=="numeric"|colclasses=="integer")
JustNumbers<-trainset%>%select(numColumns)
preObj <- preProcess(JustNumbers[, -1], method=c("center", "scale"))
scaled.JustNumbers <- predict(preObj, JustNumbers[, -1])
othrColumns<-trainset%>%select(-numColumns)
trainset<-cbind(trainset$Target,othrColumns,scaled.JustNumbers)
names(trainset)[1]<-'Target'
Before variable reduction, we create dummy variables for all the factor columns first.
trainset<-Filter(function(x)(length(unique(x))>1), trainset)
x <- model.matrix(Target~., trainset)[,-1]
y<-trainset$Target
z<-as.data.frame(x)
colnames(z)<-make.names(names(as.data.frame(z)), unique = FALSE, allow_ = TRUE)
trainset<-cbind(y,as.data.frame(z))
names(trainset)[1]<-'Target'
Lasso regression
trainset<-Filter(function(x)(length(unique(x))>1), trainset)
set.seed(123)
cv.lasso <- cv.glmnet(x, (y), type.measure='mse',nfolds=5, alpha = 1, family = "gaussian")
cv.lasso$lambda.min
## [1] 0.0006357814
model <- glmnet(x, y, alpha = 1, family = "gaussian",
lambda = cv.lasso$lambda.min)
print(model)
##
## Call: glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = cv.lasso$lambda.min)
##
## Df %Dev Lambda
## [1,] 205 0.8522 0.0006358
rank<-caret::getModelInfo("glmnet")$glmnet$varImp(model,lambda=cv.lasso$lambda.min)
#check variables with 0 coefficients
c<-coef(cv.lasso,s='lambda.min',exact=TRUE)
inds<-which(c==0)
variables<-row.names(c)[sort(inds,decreasing = FALSE)]
print(variables[1:50])
## [1] "flag_Weat_Jan_Type_prcp1" "flag_Weat_Feb_prcp1"
## [3] "flag_Weat_Feb_Type_prcp1" "flag_Weat_Mar_prcp1"
## [5] "flag_Weat_Mar_Type_prcp1" "flag_Weat_Apr_prcp1"
## [7] "flag_Weat_Apr_Type_prcp1" "flag_Weat_May_prcp1"
## [9] "flag_Weat_May_Type_prcp1" "flag_Weat_Jun_prcp1"
## [11] "flag_Weat_Jun_Type_prcp1" "flag_Weat_Jul_prcp1"
## [13] "flag_Weat_Jul_Type_prcp1" "flag_Weat_Aug_prcp1"
## [15] "flag_Weat_Aug_Type_prcp1" "flag_Weat_Sep_prcp1"
## [17] "flag_Weat_Sep_Type_prcp1" "flag_Weat_Oct_prcp1"
## [19] "flag_Weat_Oct_Type_prcp1" "flag_Weat_Nov_prcp1"
## [21] "flag_Weat_Nov_Type_prcp1" "flag_Weat_Dec_prcp1"
## [23] "flag_Weat_Dec_Type_prcp1" "flag_Weat_Jan_Type_tavg1"
## [25] "flag_Weat_Feb_tavg1" "flag_Weat_Feb_Type_tavg1"
## [27] "flag_Weat_Mar_tavg1" "flag_Weat_Mar_Type_tavg1"
## [29] "flag_Weat_Apr_tavg1" "flag_Weat_Apr_Type_tavg1"
## [31] "flag_Weat_May_tavg1" "flag_Weat_May_Type_tavg1"
## [33] "flag_Weat_Jun_tavg1" "flag_Weat_Jun_Type_tavg1"
## [35] "flag_Weat_Jul_tavg1" "flag_Weat_Jul_Type_tavg1"
## [37] "flag_Weat_Aug_tavg1" "flag_Weat_Aug_Type_tavg1"
## [39] "flag_Weat_Sep_tavg1" "flag_Weat_Sep_Type_tavg1"
## [41] "flag_Weat_Oct_tavg1" "flag_Weat_Oct_Type_tavg1"
## [43] "flag_Weat_Nov_tavg1" "flag_Weat_Nov_Type_tavg1"
## [45] "flag_Weat_Dec_tavg1" "flag_Weat_Dec_Type_tavg1"
## [47] "flag_Weat_Jan_tmax1" "flag_Weat_Jan_Type_tmax1"
## [49] "flag_Weat_Feb_tmax1" "flag_Weat_Feb_Type_tmax1"
From Lasso regression, we see that most of the variables with 0 coefficients are flag variables. This gives us a general idea on what variables to drop later.
trainset=trainset[!is.na(trainset$Target),]
#drop no variation columns
trainset<-Filter(function(x)(length(unique(x))>1), trainset)
#model_1
base.model<-glm(Target~., data=trainset, family = 'gaussian')
#Get variable list with P-value less than 0.05
varslist<-summary(base.model)$coefficients[(summary(base.model)$coef[, "Pr(>|t|)"])<0.05&!is.na((summary(base.model)$coef[, "Pr(>|t|)"])<0.05),1]
#model_2
trainset2<-trainset[,c(names(varslist)[2:length(names(varslist))])]
trainset2<-cbind(trainset$Target,trainset2)
names(trainset2)[1]<-'Target'
base.model2<-glm(Target~., data=trainset2, family = 'gaussian')
#Get variable list with P-value less than 0.05
varslist2<-summary(base.model2)$coefficients[(summary(base.model2)$coef[, "Pr(>|t|)"])<0.05,1]
names(varslist2)
vif(lm(base.model2, data=trainset2))
#model_3
trainset3<-trainset2[,c(names(varslist2)[2:length(names(varslist2))])]
trainset3<-cbind(trainset$Target,trainset3)
names(trainset3)[1]<-'Target'
base.model3<-glm(Target~., data=trainset3, family = 'gaussian')
vif(lm(base.model3, data=trainset3))
#model_4
#Drop variables with high vif values in base.model3
trainset4<-trainset3%>%select(-Impute_HC01_VC03_cap,-Impute_Weat_Oct_tmax_cap,-Impute_HC01_VC141_cap,-flag_HC01_VC691)
base.model4<-glm(Target~., data=trainset4, family = 'gaussian')
summary(base.model4)
##
## Call:
## glm(formula = Target ~ ., family = "gaussian", data = trainset4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.28188 -0.10381 0.01252 0.11990 1.72925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.794244 0.004107 2871.916 < 2e-16 ***
## flag_HC01_VC091 0.249563 0.056295 4.433 9.50e-06 ***
## flag_HC01_VC701 0.179391 0.039556 4.535 5.90e-06 ***
## flag_HC01_VC1461 -0.112936 0.027384 -4.124 3.79e-05 ***
## flag_HC01_VC1551 0.201619 0.031707 6.359 2.23e-10 ***
## F_STATE.xCO 0.096090 0.022293 4.310 1.66e-05 ***
## F_STATE.xHI 0.247879 0.031399 7.894 3.61e-15 ***
## F_STATE.xKS -0.161327 0.018551 -8.697 < 2e-16 ***
## F_STATE.xMT 0.089192 0.021167 4.214 2.56e-05 ***
## F_STATE.xSD -0.194366 0.024675 -7.877 4.14e-15 ***
## F_STATE.xTX -0.201945 0.016089 -12.552 < 2e-16 ***
## F_STATE.xUT 0.122146 0.025860 4.723 2.39e-06 ***
## F_STATE.xVA 0.129047 0.027445 4.702 2.65e-06 ***
## F_STATE.xWY 0.144247 0.028322 5.093 3.66e-07 ***
## Impute_Weat_Jul_prcp_cap -0.028106 0.004596 -6.115 1.04e-09 ***
## Impute_Weat_Jan_tavg_cap 0.106466 0.009468 11.245 < 2e-16 ***
## Impute_Weat_Sep_tavg_cap -0.108954 0.009212 -11.827 < 2e-16 ***
## Impute_HC01_VC05_cap 0.012284 0.006380 1.925 0.054230 .
## Impute_HC01_VC08_cap 0.010799 0.003707 2.913 0.003598 **
## Impute_HC01_VC21_cap 0.031565 0.006703 4.709 2.57e-06 ***
## Impute_HC01_VC42_cap 0.025022 0.006570 3.809 0.000141 ***
## Impute_HC01_VC50_cap 0.014090 0.004979 2.830 0.004673 **
## Impute_HC01_VC70_cap -0.020904 0.003996 -5.231 1.76e-07 ***
## Impute_HC01_VC85_cap -0.039027 0.007486 -5.213 1.94e-07 ***
## Impute_HC01_VC94_cap 0.021537 0.004491 4.796 1.67e-06 ***
## Impute_HC01_VC120_cap -0.150008 0.009165 -16.367 < 2e-16 ***
## Impute_HC01_VC121_cap -0.068646 0.011326 -6.061 1.46e-09 ***
## Impute_HC01_VC122_cap 0.014258 0.012468 1.144 0.252855
## Impute_HC01_VC123_cap 0.034687 0.012259 2.829 0.004683 **
## Impute_HC01_VC125_cap 0.039963 0.012364 3.232 0.001237 **
## Impute_HC01_VC126_cap 0.050687 0.011607 4.367 1.29e-05 ***
## Impute_HC01_VC127_cap 0.023375 0.006624 3.529 0.000421 ***
## Impute_HC01_VC140_cap 0.124565 0.014076 8.849 < 2e-16 ***
## Impute_HC01_VC142_cap -0.090684 0.013074 -6.936 4.58e-12 ***
## Impute_HC01_VC143_cap -0.050596 0.014079 -3.594 0.000329 ***
## Impute_HC01_VC144_cap -0.024033 0.012891 -1.864 0.062352 .
## Impute_HC01_VC145_cap -0.040346 0.011992 -3.364 0.000773 ***
## Impute_HC01_VC146_cap 0.328105 0.008051 40.753 < 2e-16 ***
## Impute_HC01_VC155_cap 0.062627 0.006626 9.451 < 2e-16 ***
## Impute_HC01_VC170_cap 0.095805 0.011102 8.630 < 2e-16 ***
## Impute_HC01_VC191_cap 0.075179 0.006644 11.315 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.05424897)
##
## Null deviance: 1477.93 on 4709 degrees of freedom
## Residual deviance: 253.29 on 4669 degrees of freedom
## AIC: -316.53
##
## Number of Fisher Scoring iterations: 2
vif(lm(base.model4, data=trainset4))
## flag_HC01_VC091 flag_HC01_VC701 flag_HC01_VC1461
## 1.856701 1.988951 2.097918
## flag_HC01_VC1551 F_STATE.xCO F_STATE.xHI
## 2.043867 1.106112 1.306127
## F_STATE.xKS F_STATE.xMT F_STATE.xSD
## 1.109903 1.160733 1.077041
## F_STATE.xTX F_STATE.xUT F_STATE.xVA
## 1.566230 1.171144 1.051655
## F_STATE.xWY Impute_Weat_Jul_prcp_cap Impute_Weat_Jan_tavg_cap
## 1.091277 1.833726 7.780751
## Impute_Weat_Sep_tavg_cap Impute_HC01_VC05_cap Impute_HC01_VC08_cap
## 7.366807 3.532799 1.193014
## Impute_HC01_VC21_cap Impute_HC01_VC42_cap Impute_HC01_VC50_cap
## 3.900621 3.746380 2.151665
## Impute_HC01_VC70_cap Impute_HC01_VC85_cap Impute_HC01_VC94_cap
## 1.385987 4.865102 1.750420
## Impute_HC01_VC120_cap Impute_HC01_VC121_cap Impute_HC01_VC122_cap
## 7.291768 11.135451 13.493400
## Impute_HC01_VC123_cap Impute_HC01_VC125_cap Impute_HC01_VC126_cap
## 13.045743 13.270368 11.694692
## Impute_HC01_VC127_cap Impute_HC01_VC140_cap Impute_HC01_VC142_cap
## 3.808437 17.199512 14.837614
## Impute_HC01_VC143_cap Impute_HC01_VC144_cap Impute_HC01_VC145_cap
## 17.206585 14.425669 12.483018
## Impute_HC01_VC146_cap Impute_HC01_VC155_cap Impute_HC01_VC170_cap
## 5.626486 3.811400 10.698985
## Impute_HC01_VC191_cap
## 3.831942
We do not see extremly high vif values. Therefore, our final linear regression model does not have multi-collinearity issue.
*For any transformation explanations, refer to previous trainset feature engineering section.
for (i in 1:ncol(testset)){
if(colnames(testset)[i]!='HC01_VC128' & colnames(testset)[i]!='Target'){
testset[,paste0("flag_",colnames(testset)[i])]<-as.factor((ifelse(is.na(testset[,i]),1,0)))
}
}
testset<-as.data.frame(testset%>%select(-HC01_VC128))
#capping variables using trainset capping values
for (i in 1:ncol(testset)){
if ((substr(colnames(testset)[i],1,5)=='Weat_'|substr(colnames(testset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('testset$',colnames(testset)[i]))))=='numeric'|class(eval(parse(text=paste0('testset$',colnames(testset)[i]))))=='integer')){
testset[,paste0(colnames(testset)[i],'_cap')]<-pmin(pmax(testset[,i],capping[1,colnames(testset)[i]]),capping[2,colnames(testset)[i]])
}
}
testset<-as.data.frame(testset%>%select(Target,ends_with('_cap'),starts_with('flag_'),contains('_Type'),STATE.x,ELEVATION))
testset<-as.data.frame(testset%>%left_join(weat_state_avg,by='STATE.x'))
#Impute house characteristics data first with k nearest neighbor
HC_01<-testset%>%select(starts_with('HC01_'))
HC_01_Imputed<-knnImputation(HC_01,k=3)
names(HC_01_Imputed)<-paste0('Impute_',colnames(HC_01_Imputed))
#Imput weather data with mode and state average
for (i in 1:ncol(testset)){
if((class(testset[,i])=='character'|class(testset[,i])=='factor')&substr(colnames(testset)[i],1,5)!='flag_'){ #Impute Weather type data and factors
testset[,paste0("F_",colnames(testset)[i])]<-impute(as.factor(testset[,i]),mode)
}
else if((substr(colnames(testset)[i],1,5)=='Weat_')&(class(testset[,i])=='numeric'|class(testset[,i])=='integer')&(substr(colnames(testset)[i],1,10)!='state_avg_')){ #Impute Weather numeric data
testset[,i]<-as.numeric(testset[,i])
testset[,paste0("Impute_",colnames(testset)[i])]=coalesce(testset[,i],eval(parse(text=paste0('testset$state_avg_',colnames(testset)[i]))))#Impute state average
}
}
testset<-cbind(testset,HC_01_Imputed)
testset<-as.data.frame(testset%>%select(Target,starts_with('flag_'),starts_with('Impute_'),starts_with('F_'),ELEVATION))
testset$Regions<-as.factor(ifelse(testset$F_STATE.x%in%c('NT','WY','CO','ND','SD','NE','KS'),'northwest',ifelse(testset$F_STATE.x%in%c('MN','IA','IL','WI','MO','OH','IN','MI'),'midwest',ifelse(testset$F_STATE.x%in%c('AZ','NM','TX','OK'),'southwest',ifelse(testset$F_STATE.x%in%c('WA','OR','NV','ID','UT','CA'),'westcoast',ifelse(testset$F_STATE.x%in%c('AR','LA','MS','TN','AL','GA','FL','SC','NC','VA','WV','KY'),'south',('northeast')))))))
# testset<-testset%>%select(-F_STATE.x)
testset$Impute_Weat_Sep_prcp_g1_cap<-pmax(-0.08290098-testset$Impute_Weat_Sep_prcp_cap,0)
testset$Impute_Weat_Sep_prcp_g2_cap<-pmax(testset$Impute_Weat_Sep_prcp_cap+0.08290098,0)
testset<-testset%>%select(-Impute_Weat_Sep_prcp_cap)
colclasses <- sapply(testset,class)
numColumns <- which(colclasses=="numeric"|colclasses=="integer")
JustNumbers<-testset%>%select(numColumns)
preObj <- preProcess(JustNumbers[, -1], method=c("center", "scale"))
scaled.JustNumbers <- predict(preObj, JustNumbers[, -1])
OthrNumbers<-testset%>%select(-numColumns)
set<-cbind(testset$Target,OthrNumbers,scaled.JustNumbers)
names(set)[1]<-'Target'
testset<-Filter(function(x)(length(unique(x))>1), testset)
# testset$Target<-log(testset$Target)
x <- model.matrix(Target~., testset)[,-1]
y<-testset$Target
x<-as.data.frame(x)
colnames(x)<-make.names(names(as.data.frame(x)), unique = FALSE, allow_ = TRUE)
testset<-cbind(y,as.data.frame(x))
names(testset)[1]<-'Target'
testset$glm_pred<-round(predict(base.model4,newdata=testset))
As we standardized our independent variables to mean=0, we need to rescale our prediction back to the actual dependent variable mean level for the next step.
testset$glm_pred_rescaled<-(testset$glm_pred)*(mean(exp(testset$Target))/mean(testset$glm_pred))
*GAINS.CHART function is a self defined function. The function uses predicted dependent variables quantiles as x-axis, and plot the actual dependent variable on y-axis. A good gains chart should show a nice upward trend with good KS value, and good percentage gains between the highest group and the lowest group.
GAINS.CHART((exp(testset$Target)),(testset$glm_pred_rescaled),n.rank=20)
## $gainschart
## Rank Actual_Sum Actual_Mean Pred_Sum Pred_Mean Count
## 1 1 7845200 65926.05 9025484 75844.40 119
## 2 2 8409900 71879.49 11009493 94098.23 117
## 3 3 9184900 77838.14 11952935 101296.06 118
## 4 4 10491600 85996.72 13119060 107533.28 122
## 5 5 10599900 93804.42 12762194 112939.77 113
## 6 6 11522100 95223.97 14223204 117547.14 121
## 7 7 12095600 105179.13 14134706 122910.49 115
## 8 8 13959100 115364.46 15574630 128715.95 121
## 9 9 13865000 118504.27 15754182 134651.13 117
## 10 10 15564000 126536.59 17238195 140147.93 123
## 11 11 14659800 133270.91 16065361 146048.74 110
## 12 12 16702800 141549.15 18075568 153182.78 118
## 13 13 18288600 153685.71 19249041 161756.64 119
## 14 14 19780200 169061.54 19985776 170818.60 117
## 15 15 20999100 181026.72 20935289 180476.63 116
## 16 16 22843200 193586.44 22540707 191022.94 118
## 17 17 25399000 209909.09 24807141 205017.70 121
## 18 18 26717600 234364.91 25620873 224744.50 114
## 19 19 33790400 283952.94 30497190 256278.91 119
## 20 20 60595500 517910.26 40742472 348226.25 117
##
## $ks
## [1] 21.29
##
## $gini
## [1] 58.76
Lorenz Curve is commonly used to check model overfitting
#predict and rescale on trainset
trainset4$glm_pred<-round(predict(base.model4,newdata=trainset4))
trainset4$glm_pred_rescaled<-(trainset4$glm_pred)*(mean(exp(trainset4$Target))/mean(trainset4$glm_pred))
##Lorenz Curve
trainCheck<- OrderedCDF(Score = trainset4$glm_pred, Loss = exp(trainset4$Target), Weight =rep(1,nrow(trainset4)), NBins = 1000)
ValidationCheck<- OrderedCDF(Score = testset$glm_pred, Loss = exp(testset$Target), Weight =rep(1,nrow(testset)), NBins = 1000)
ValidationWeight <- ValidationCheck$WeightPts
ValidationLoss <- ValidationCheck$LossPts
TrainWeight <- trainCheck$WeightPts
TrainLoss <- trainCheck$LossPts
ModelComparisonData <- data.frame(ValidationWeight, ValidationLoss, TrainWeight, TrainLoss)
ModelComparisonData %>% ggvis(~ValidationWeight, ~ValidationLoss) %>% layer_paths(stroke := "blue") %>%
layer_paths(data = ModelComparisonData, x = ~ValidationWeight, y = ~ValidationWeight, stroke := "black") %>%
layer_paths(data = ModelComparisonData, x = ~TrainWeight, y = ~TrainLoss, stroke := "red")
The above chart shows lorenz curve for Trainset (redline) and Testset (blueline). Further away the curves are from the diagonal line means better prediction power. From the chart, our testset performance was even better than trainset, therefore, no overfitting problem presents in the linear regression model.
#Train set
#Create missing flags
for (i in 1:ncol(gbm.trainset)){
if(colnames(gbm.trainset)[i]!='HC01_VC128' & colnames(gbm.trainset)[i]!='Target'){
gbm.trainset[,paste0("flag_",colnames(gbm.trainset)[i])]<-as.factor((ifelse(is.na(gbm.trainset[,i]),1,0)))
}
}
#boxplot shows outliers for dependent variable
ggplot(data=gbm.trainset,aes(x=1,y=gbm.trainset$HC01_VC128))+geom_boxplot()+xlab('Boxplot')+ylab('')
#tranform and cap dependent variable
gbm.trainset$Target<-pmin(gbm.trainset$HC01_VC128,quantile(gbm.trainset$HC01_VC128,0.99))
gbm.trainset$Target<-gbm.trainset$HC01_VC128
gbm.trainset<-as.data.frame(gbm.trainset%>%select(-HC01_VC128))
#Remove independent variable outliers
for (i in 1:ncol(gbm.trainset)){
if ((substr(colnames(gbm.trainset)[i],1,5)=='Weat_'|substr(colnames(gbm.trainset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))))=='numeric'|class(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))))=='integer')){
gbm.trainset[,i]<-pmin(pmax(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))),quantile(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))),0.05,na.rm=TRUE)),quantile(eval(parse(text=paste0('gbm.trainset$',colnames(gbm.trainset)[i]))),0.95,na.rm=TRUE))
}
}
#save a capping value table for gbm.testset transformation later
gbm.trainset.num<-gbm.trainset[,sapply(gbm.trainset,class)=='numeric'|sapply(gbm.trainset,class)=='integer']
capping.gbm<-as.matrix(rbind(colnames(gbm.trainset.num),sapply(gbm.trainset.num,function(x) quantile(x,0.05,na.rm=TRUE)),sapply(gbm.trainset.num,function(x) quantile(x,0.95,na.rm=TRUE))))
colnames(capping.gbm)<-capping.gbm[1,]
capping.gbm<-as.data.frame(capping.gbm[2:3,])
#Convert character variables to factors
for (i in 1:ncol(gbm.trainset)){
if(class(gbm.trainset[,i])=="character"){
gbm.trainset[,i]<-as.factor(gbm.trainset[,i])
} else {
gbm.trainset[,i]=gbm.trainset[,i]
}
}
#select useful columns
gbm.trainset<-as.data.frame(gbm.trainset%>%select(Target,starts_with('Weat_'),starts_with('flag_'),starts_with('HC01_'),STATE.x,ELEVATION))
#Create missing flags
for (i in 1:ncol(gbm.testset)){
if(colnames(gbm.testset)[i]!='HC01_VC128' & colnames(gbm.testset)[i]!='Target'){
gbm.testset[,paste0("flag_",colnames(gbm.testset)[i])]<-as.factor((ifelse(is.na(gbm.testset[,i]),1,0)))
}
}
#Create Target variable
gbm.testset$Target<-gbm.testset$HC01_VC128
gbm.testset<-as.data.frame(gbm.testset%>%select(-HC01_VC128))
#capping variables using gbm.trainset capping values
for (i in 1:ncol(gbm.testset)){
if ((substr(colnames(gbm.testset)[i],1,5)=='Weat_'|substr(colnames(gbm.testset)[i],1,5)=='HC01_')&(class(eval(parse(text=paste0('gbm.testset$',colnames(gbm.testset)[i]))))=='numeric'|class(eval(parse(text=paste0('gbm.testset$',colnames(gbm.testset)[i]))))=='integer')){
gbm.testset[,paste0(colnames(gbm.testset)[i])]<-pmin(pmax(gbm.testset[,i],capping.gbm[1,colnames(gbm.testset)[i]]),capping.gbm[2,colnames(gbm.testset)[i]])
}
}
#Convert character variables to factors
for (i in 1:ncol(gbm.testset)){
if((class(gbm.testset[,i])=='character')){
gbm.testset[,i]<-as.factor(gbm.testset[,i])
}
}
#select useful columns
gbm.testset<-as.data.frame(gbm.testset%>%select(Target,starts_with('Weat_'),starts_with('flag_'),starts_with('HC01_'),STATE.x,ELEVATION))
As parameter tuning is time-consuming, R markdown file documented the codes as below, however, cross-validation parameter tuning was run separately from this file with the following codes.
# ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# # repeats = 5,
# summaryFunction=defaultSummary)
#
#
# set.seed(5627)
#
# gbmGrid <- expand.grid(interaction.depth = c(2,3),
# n.trees = (1:3)*500,
# shrinkage = c(0.1),
# n.minobsinnode = 100)
#
# gbm_fit <- train(Target~.,
# data = trainset,
# method = "gbm",
# bag.fraction = 0.5,
# verbose = FALSE,
# metric = "RMSE",
# trControl = ctrl,
# tuneGrid =gbmGrid
# )
#
# plot(gbm_fit)
# summary(gbm_fit)
# gbm_fit$bestTune
#run gbm with tuned parameters##############
set.seed(12345)
gbm_tuned <- gbm(Target~.,
distribution = "gaussian",
data = gbm.trainset,
n.trees = 1500,# convergence issue, just need enough to converge
shrinkage = 0.1, # convergence issue, lower is better but takes longer.
interaction.depth = 3, # effects tree complexity
n.minobsinnode = 200, # tree argument 1082/bag.fraction/train.fraction
bag.fraction = 0.5, #optimization argument
train.fraction = 1.0,
cv.folds=5, #gives condition on how good a model is doing
keep.data = TRUE,
verbose = TRUE, #"CV",
class.stratify.cv=NULL
)
best.iter <- gbm.perf(gbm_tuned, method="cv")
#Predict on trainset and testset
gbm.trainset$gbm_pred<- predict.gbm(gbm_tuned, n.trees = best.iter,newdata = gbm.trainset)
gbm.testset$gbm_pred <- predict.gbm(gbm_tuned, n.trees = best.iter,newdata = gbm.testset)
#Gains Chart on testset
GAINS.CHART(gbm.testset$Target,gbm.testset$gbm_pred,n.rank=20)
## $gainschart
## Rank Actual_Sum Actual_Mean Pred_Sum Pred_Mean Count
## 1 1 7132200 60442.37 6211418 52639.14 118
## 2 2 8247200 69891.53 8043978 68169.30 118
## 3 3 8929800 75676.27 8930668 75683.62 118
## 4 4 9549900 81623.08 9617977 82204.93 117
## 5 5 10175500 86233.05 10377220 87942.55 118
## 6 6 11151800 94506.78 10971049 92974.99 118
## 7 7 11485900 98170.09 11543818 98665.11 117
## 8 8 12691500 107555.08 12440965 105431.91 118
## 9 9 13120100 111187.29 13256961 112347.13 118
## 10 10 14265800 120896.61 14213748 120455.49 118
## 11 11 14716500 125782.05 15053535 128662.69 117
## 12 12 16583000 140533.90 16204647 137327.52 118
## 13 13 17664000 149694.92 17454559 147919.99 118
## 14 14 18936600 161851.28 18823928 160888.27 117
## 15 15 20793000 176211.86 20667931 175151.95 118
## 16 16 22572300 191290.68 22792007 193152.61 118
## 17 17 25909800 221451.28 25666247 219369.63 117
## 18 18 29003300 245790.68 29702521 251716.28 118
## 19 19 38269500 324317.80 37794034 320288.42 118
## 20 20 62115800 526405.08 57588786 488040.56 118
##
## $ks
## [1] 23.29
##
## $gini
## [1] 63.75
##Lorenz Curve
trainCheck<- OrderedCDF(Score = gbm.trainset$gbm_pred, Loss = (gbm.trainset$Target), Weight =rep(1,nrow(gbm.trainset)), NBins = 1000)
ValidationCheck<- OrderedCDF(Score = gbm.testset$gbm_pred, Loss = (gbm.testset$Target), Weight =rep(1,nrow(gbm.testset)), NBins = 1000)
ValidationWeight <- ValidationCheck$WeightPts
ValidationLoss <- ValidationCheck$LossPts
TrainWeight <- trainCheck$WeightPts
TrainLoss <- trainCheck$LossPts
ModelComparisonData <- data.frame(ValidationWeight, ValidationLoss, TrainWeight, TrainLoss)
ModelComparisonData %>% ggvis(~ValidationWeight, ~ValidationLoss) %>% layer_paths(stroke := "blue") %>%
layer_paths(data = ModelComparisonData, x = ~ValidationWeight, y = ~ValidationWeight, stroke := "black") %>%
layer_paths(data = ModelComparisonData, x = ~TrainWeight, y = ~TrainLoss, stroke := "red")
Based on the analysis, we see a significant relationship between weather related attributes and home pices. We can make some preliminary conclusions based on the analysis:
Home price is higher when precipitation is lower, which is intuitive, as high summer precipitation may cause flood. People prefer cooler weather in summer, which drives housing demand, and increase the home price, which is consistent with our results that home price is higher when summer tempurature is lower. Based on the analysis, winter tempurature does not have as big of an impact to the home price as summer tempurater.
Geographic location has a big impace on home price, which is also intuitive. This project used ‘State’ to model the impact of geographic location on home price. For future improvement, regional factors can be considered as the U.S. is typically grouped into climate regions based on similar weather patterns.
By comparing different models, gradient boosting model is our chosen model for this project, as it performed better than liner regression with better gains chart, and higher KS score. Gradient boosing was able to catch the non-linear relationship between our independent variables and the dependent variable, such as precipitation. Also gradient boosting does not require as much feature engineering as linear regression, which is ideal for a time-constrained project. Binning and smoothing can help to address some non-linear relationships in linear regression. However, the project only looked into weather related variables for non-linear relationship transformation, which can be improved in future study.
Segmentation based on geographic regions
Linear regression binning and smoothing
Missing value imputation with multiple imputation
Test the results on Holdout dataset.