Final Term - Data Analysis @ BC

HOUSING PRICE PREDICTION (AMES, IOWA)

INTRODUCTION AND PROBLEM STATEMENT
We are provided with data of all residential home sales in Ames, Iowa between 2006 and 2010. The data set contains many explanatory variables on the quality and quantity of physical attributes of residential homes in Iowa sold in these 5 yrs. Most of the variables describe information a typical home buyer would like to know about a property - square footage, number of bedrooms and bathrooms, size of lot, etc. .

And the aim of this task is to predict the future sale price of housing in ames, Iowa, US as accurately as possible. This will be done by building advanced regression models.

PURPOSE
The main goal of this task is to apply the learnings of the course ADEC730201 and go a step further if possible in learning the tricks of the trade of data analysis/science including feature engineering and advanced forecasting techniques.

# Definining some housekeeping here
options(warn = -1)
options(digits=4)

# Starting to read the data
train<-read.csv("D:/Boston College/MS AE Courses/Data Analysis/train.csv",header=TRUE)
test <-read.csv("D:/Boston College/MS AE Courses/Data Analysis/test.csv",header=TRUE)
  
# How the target variable looks
hist(train$SalePrice/1000)

hist(log(train$SalePrice))

The original Sales price as we can see is heavily right skewed. Log transformation modifies it to an almost norma distribution.

DATA
The current analysis is done on the dataset available on kaggle. This dataset has been divided into test and train datasets already. Each having 79 variables and 1460 and 1459 observations each. Out of 79 variables(other than Id and Salesprice variable), 46 are categorical and 33 are continuous.
The original source of the data, however is this and has many more records and features.

A crisp description of all the variables is available here. The response variable for our problem is SalePrice.

LITERATURE REVIEW
There are many scholarly papers that talk about the advanced regression techniques as strong quantitative tool used for future predictions. Few of the ones I used more than others in doing this project are: The process and utility of classification and regression tree methodology in nursing research J Adv Nurs. 2014 Jun; 70(6): 1276–1286. This paper talks specifically about how to use classification trees in a similar (but not same) situation. This paper quite well discusses and critisizes the classification tree technique.

Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, et al. (2016) Random Forests for Global and Regional Crop Yield Predictions. PLoS ONE11(6): e0156571. https://doi.org/10.1371/journal.pone.0156571. Random Forests for Global and Regional Crop Yield Predictions This is a very nicely written paper on Random Forest Implementation. This study evaluated the efficacy of RF regression using MLR as a benchmark to model complex yield responses of wheat, grain maize, potato, and silage maize at global and regional scales. The RF algorithm has many advantages to regress complex crop systems, but is not yet being widely used in this field.

Singh, B. & Vyas, O.P., (2014). A Meta-Heuristic Regression-Based Feature Selection for Predictive Analytics. Data Science Journal. 13, pp.106–118. DOI: http://doi.org/10.2481/dsj.14-032. A Meta-Heuristic Regression-Based Feature Selection for Predictive Analytics This article suggests a meta-heuristic aproach for feature selectionin order to achieve the best possible accuracy. This approach the author mentions proves to be better than conventional feature selection methods for bigger datasets and gives better accuracy.

EXPLORATORY DATA ANALYSIS

Exploring the Data & Data Visualization
Looking at the data description and head(train), I came up with type.csv file or type data frame. It indicates that out of 80 variabless(“Id”" ignored) 30 variables are numeric, 43 are categorical or factor, 2 are ordinal categorical like rank values (which we will use as it is first) and 5 date (month/yr values).
I will be dropping the 5 date variables after creating 2 new numeric variables representing the time in yrs(discussed further below).

# checking for number of NAs for each variable
x<-data.frame(miss_cnt=sapply(train, function(x) sum(is.na(x))))
#40 % of 1490 is 596.

type<-read.csv("D:/Boston College/MS AE Courses/Data Analysis/type.csv",header=TRUE)

num_vars<-data.frame(var=subset(type,type$Column_Type=="Numeric" | 
                                     type$Column_Type=="Ordinal categorical")[,1])
cat_vars<-data.frame(var=subset(type,type$Column_Type=="Categorical")[,1])

for(i in 1:nrow(num_vars)){
plot(eval(parse(text=paste("train$",num_vars[i,1]))), train$SalePrice/1000, main="Scatterplots", xlab=num_vars[i,1], ylab="SalePrice", pch=12)
}

#cor(train[,num_vars], use="complete.obs",method="kendall")

This gives the scatter plot of all the numeric variables and which visually shows how each them is related to the target variable (SalePrice).

Identifying Significant Variables

Numerical Variables

library(lattice)
b<-grep("Bsmt", names(train), value = TRUE)
#b2<-setdiff(b,grep("Bsmt", rownames(x1), value = TRUE))

# scatterplot matrix 
splom(train[c(b,"SalePrice")], main="Correlation of Basement Features with SalePrice")

cor_mat<-NULL
# 1 is Saleprice variable
for(i in 2:nrow(num_vars)){
    cor_mat[i] = cor(train$SalePrice,eval(parse(text=paste("train$",num_vars[i,]))))
}
cor_file<-NULL
cor_file<-data.frame(cbind(var=as.matrix(num_vars),corr=cor_mat))
cor_file$corr<-as.numeric(as.character(cor_file$corr))
cor_file$select<-ifelse(((cor_file$corr >= 0.50) ),1,ifelse((cor_file$corr <= -0.5),1,0))

# Numerical Variables selected due to strong and significant relationship 
sum(cor_file$select,na.rm=TRUE)

## [1] 8

cor_file1<-subset(cor_file,cor_file$select == 1)[,1]
final_num_vars<-as.matrix(cor_file1)

8 out of 30 variables have high correlation with SalePrice so we start our simple model with these features to begin with. This is also shown by the darker shades of blue(positive) and red(negative) colored cubes in the correlogram shown below.

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(train[,as.matrix(final_num_vars)]), type="lower", insig = "p-value", 
         tl.srt = 45,order = "hclust")

We can very clearly see from the correlation and visualizations above that Sale Price is highly correlated with the variables selected namely GarageArea, age of house, FullBath etc.

library(corrgram)
corrgram(train[,final_num_vars], order=NULL, lower.panel=panel.shade, upper.panel=NULL,                text.panel=panel.txt, main="Ames Housing Data ")

We also see high multicollinearity in variables like GarangeCars(No. of cars in the garage) and GarageArea; TotalBsmtSF and 1stFlrSF which are quite intuitive as well.

We will be dropping the multicollinear variables in the process of making our robust model.

Categorical Variables

# RUNNING CHI SQUARE TEST TO FIND THE SIGNIFICANT VARIABLES 
chi_df<-NULL
for(i in 1:nrow(cat_vars)){
                c  = chisq.test(train[,"SalePrice"], train[,cat_vars[i,]])
    chi_df$pval[i] = c$p.value
}

chi_df1<-as.data.frame(cbind(var=cat_vars,pval=chi_df$pval))
chi_df1$select<-ifelse(chi_df$pval>=0.05,0,1)
colnames(chi_df1)[1]<-"var"
sum(chi_df1$select)

## [1] 22

final_cat_vars<-subset(chi_df1,select==1)["var"]

22 out of 44 categorical variables have a significant relationship with the SalePrice.

DATA PREPROCESSING

Missing Value Treatment
As seen above there are lots of missing values in the data. Out of the possible missing value techniques I have chosen to replace the numeric value by mean instead of losing out the column itself. We will only be dropping PoolQC field which has more than 90% of missing data. In train, numerical fields: LotFrontage and MasVnrarea are the 2 columns replace by their overall mean. However, categorical fields are replaced by the mode (value with max occurence). For categorical variables, I have replaced the missing values by the mode or the value thats appearing the most.

The code snippet for the same is given below:

tr1<-train

# For numeric variables - LotFrontage ; MasVnrarea

# REPLACEMENT BY MEAN
LotFrontage.mean<-mean(tr1$LotFrontage,na.rm=TRUE)
MasVnrArea.mean<-mean(tr1$MasVnrArea,na.rm=TRUE)

tr1$LotFrontage<-ifelse(is.na(tr1$LotFrontage),LotFrontage.mean,tr1$LotFrontage)
tr1$MasVnrArea<-ifelse(is.na(tr1$MasVnrArea),MasVnrArea.mean,tr1$MasVnrArea)

# For categorical variables - BsmtExposure ; BsmtQual; Electrical.
# REPLACEMENT BY MODE
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# x<-data.frame(miss_cnt=sapply(tr1, function(x) sum(is.na(x))))

tr1$BsmtExposure<-ifelse(is.na(tr1$BsmtExposure),getmode(tr1$BsmtExposure),tr1$BsmtExposure)
tr1$BsmtQual<-ifelse(is.na(tr1$BsmtQual),getmode(tr1$BsmtQual),tr1$BsmtQual)
tr1$Electrical<-ifelse(is.na(tr1$Electrical),getmode(tr1$Electrical),tr1$Electrical)

tr1$GarageFinish<-ifelse(is.na(tr1$GarageFinish),getmode(tr1$GarageFinish),tr1$GarageFinish)
tr1$GarageQual<-ifelse(is.na(tr1$GarageQual),getmode(tr1$GarageQual),tr1$GarageQual)
tr1$GarageCond<-ifelse(is.na(tr1$GarageCond),getmode(tr1$GarageCond),tr1$GarageCond)
tr1$BsmtCond<-ifelse(is.na(tr1$BsmtCond),getmode(tr1$BsmtCond),tr1$BsmtCond)
tr1$BsmtFinType1<-ifelse(is.na(tr1$BsmtFinType1),getmode(tr1$BsmtFinType1),tr1$BsmtFinType1)
tr1$MasVnrType<-ifelse(is.na(tr1$MasVnrType),getmode(tr1$MasVnrType),tr1$MasVnrType)

Performing the same missing value treatment to the test dataset before we can replicate the predicted model.

te1<-test
LotFrontage.mean<-mean(te1$LotFrontage,na.rm=TRUE)
MasVnrArea.mean<-mean(te1$MasVnrArea,na.rm=TRUE)

te1$LotFrontage<-ifelse(is.na(te1$LotFrontage),LotFrontage.mean,te1$LotFrontage)
te1$MasVnrArea<-ifelse(is.na(te1$MasVnrArea),MasVnrArea.mean,te1$MasVnrArea)

# Only missing in Test
te1$TotalBsmtSF<-ifelse(is.na(te1$TotalBsmtSF),mean(te1$TotalBsmtSF,na.rm=TRUE),te1$TotalBsmtSF)
te1$GarageCars<-ifelse(is.na(te1$GarageCars),mean(te1$GarageCars,na.rm=TRUE),te1$GarageCars)
te1$GarageArea<-ifelse(is.na(te1$GarageArea),mean(te1$GarageArea,na.rm=TRUE),te1$GarageArea)


te1$BsmtExposure<-ifelse(is.na(te1$BsmtExposure),getmode(te1$BsmtExposure),te1$BsmtExposure)
te1$BsmtQual<-ifelse(is.na(te1$BsmtQual),getmode(te1$BsmtQual),te1$BsmtQual)
te1$Electrical<-ifelse(is.na(te1$Electrical),getmode(te1$Electrical),te1$Electrical)

te1$GarageFinish<-ifelse(is.na(te1$GarageFinish),getmode(te1$GarageFinish),te1$GarageFinish)
te1$GarageQual<-ifelse(is.na(te1$GarageQual),getmode(te1$GarageQual),te1$GarageQual)
te1$GarageCond<-ifelse(is.na(te1$GarageCond),getmode(te1$GarageCond),te1$GarageCond)
te1$BsmtCond<-ifelse(is.na(te1$BsmtCond),getmode(te1$BsmtCond),te1$BsmtCond)
te1$BsmtFinType1<-ifelse(is.na(te1$BsmtFinType1),getmode(te1$BsmtFinType1),te1$BsmtFinType1)
te1$MasVnrType<-ifelse(is.na(te1$MasVnrType),getmode(te1$MasVnrType),te1$MasVnrType)

# Only missing in Test
te1$KitchenQual<-ifelse(is.na(te1$KitchenQual),getmode(te1$KitchenQual),te1$KitchenQual)
te1$SaleType<-ifelse(is.na(te1$SaleType),getmode(te1$SaleType),te1$SaleType)

Creating New Variables

tr2<-tr1
tr2$Built_to_Sold_yrs=tr2$YrSold-tr2$YearBuilt
tr2$yrs_since_remod=2010-tr2$YearRemodAdd

te2<-te1
te2$Built_to_Sold_yrs=te2$YrSold-te2$YearBuilt
te2$yrs_since_remod=2010-te2$YearRemodAdd

Creating Dummy Variables
From the selected categorical variables, we will create the dummy variables.

#For every unique value in the string column, create a new 1/0 column
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

final<-NULL
final<-rbind(as.matrix(final_cat_vars$var),as.matrix(cor_file1),"Built_to_Sold_yrs",
             "yrs_since_remod","SalePrice")

tr3<-tr2[,final]
te3<-te2[,rbind(as.matrix(final_cat_vars$var),as.matrix(cor_file1),"Built_to_Sold_yrs",
             "yrs_since_remod")]

#for (i in 1:nrow(final_cat_vars)){
#  print(final_cat_vars[i,])
#  print(unique(eval(parse(text=paste("tr3$",(final_cat_vars[i,]))))))
#}

list<-setdiff(final_cat_vars$var,
              c("Neighborhood","MSZoningClass","MSSubClass","PoolQC"))

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:corrgram':
## 
##     baseball

for (i in 1:length(list)){

  temp<-count(eval(parse(text=paste("tr3$",(list[i])))))
  temp1<-count(eval(parse(text=paste("te3$",(list[i])))))

  message('For variable:' ,list[i])  
      
  for(j in 1:nrow(temp)){
    level=temp[j,1]
    tr3$dummy<-0
    tr3$dummy<-ifelse(eval(parse(text=paste("tr3$",list[i]))) == level, 1, 0)
    names(tr3)[names(tr3) == "dummy"] <- paste("dummy", list[i],level, sep = "_")
    
  # message('Train:' ,level)  
    
    level1=temp1[j,1]
    te3$dummy<-0
    te3$dummy<-ifelse(eval(parse(text=paste("te3$",list[i]))) == level1, 1, 0)
    names(te3)[names(te3) == "dummy"] <- paste("dummy", list[i],level1, sep = "_")
  
  # message('Test:' ,level1)  
  }
}

## For variable:LandSlope

## For variable:BldgType

## For variable:HouseStyle

## For variable:RoofMatl

## For variable:MasVnrType

## For variable:ExterCond

## For variable:Foundation

## For variable:BsmtCond

## For variable:BsmtExposure

## For variable:BsmtFinType1

## For variable:Heating

## For variable:HeatingQC

## For variable:KitchenQual

## For variable:Functional

## For variable:GarageFinish

## For variable:GarageQual

## For variable:GarageCond

## For variable:PavedDrive

## For variable:SaleType

te_not_train<-setdiff(colnames(te3),colnames(tr3))
tr_not_test<-setdiff(colnames(tr3),colnames(te3))

  for(i in 2:length(tr_not_test)){
    te3$dummy<-0
    names(te3)[names(te3) == "dummy"] <- tr_not_test[i]
}

model_var_list<-setdiff(colnames(tr3),list)
model_var_list<-setdiff(model_var_list,c("SalePrice","Neighborhood","Pool_QC",
                                         grep("NA", colnames(tr3), value = TRUE)))


# Function for creating variable list for Random Forest Algorithm
RF_model_vars <- function(var) {
    paste(modelVarList,var, sep=" + ")
}

#Applying the same feature selection and manipulation to the "test" dataset

test_var_list<-setdiff(colnames(te3),list)
test_var_list<-setdiff(test_var_list,c("SalePrice","dummy_MasVnrType_NA",
                "dummy_BsmtCond_NA","dummy_BsmtExposure_NA","dummy_BsmtFinType1_NA",
                "dummy_GarageFinish_NA","dummy_GarageQual_NA","dummy_GarageCond_NA",
                "dummy_PoolQC_NA", "Neighborhood","Pool_QC",
                "dummy_HouseStyle_NA","dummy_RoofMatl_NA","dummy_Heating_NA"))

#library(mlr)
#t<-createDummyFeatures(train, cols = final_cat_vars[1:2,])

#Checking missing again.
x3<-data.frame(miss_cnt=sapply(tr3[model_var_list], function(x) sum(is.na(x))))
x4<-data.frame(miss_cnt=sapply(te3[test_var_list], function(x) sum(is.na(x))))

MODEL BUILDING

glm.simple <- lm(SalePrice ~ MSSubClass + OverallQual + TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars + GarageArea + dummy_LandSlope_Sev + dummy_BldgType_TwnhsE + dummy_HouseStyle_SLvl + dummy_RoofMatl_WdShngl + dummy_ExterCond_TA + dummy_Foundation_Wood + dummy_Heating_Wall + dummy_HeatingQC_TA + dummy_KitchenQual_TA + dummy_Functional_Typ  + dummy_PavedDrive_Y + dummy_SaleType_WD
, data = tr3)

summary(glm.simple)

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + OverallQual + TotalBsmtSF + 
##     X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars + 
##     GarageArea + dummy_LandSlope_Sev + dummy_BldgType_TwnhsE + 
##     dummy_HouseStyle_SLvl + dummy_RoofMatl_WdShngl + dummy_ExterCond_TA + 
##     dummy_Foundation_Wood + dummy_Heating_Wall + dummy_HeatingQC_TA + 
##     dummy_KitchenQual_TA + dummy_Functional_Typ + dummy_PavedDrive_Y + 
##     dummy_SaleType_WD, data = tr3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -452588  -18569   -1481   16461  278643 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -72476.52    9396.75   -7.71  2.3e-14 ***
## MSSubClass               -158.91      32.77   -4.85  1.4e-06 ***
## OverallQual             20362.85    1182.25   17.22  < 2e-16 ***
## TotalBsmtSF                19.17       4.38    4.38  1.3e-05 ***
## X1stFlrSF                   7.34       5.31    1.38  0.16658    
## GrLivArea                  48.74       4.31   11.31  < 2e-16 ***
## FullBath                 1241.32    2530.30    0.49  0.62380    
## TotRmsAbvGrd             -210.95    1152.20   -0.18  0.85476    
## GarageCars              12980.96    3002.09    4.32  1.6e-05 ***
## GarageArea                  9.80      10.30    0.95  0.34149    
## dummy_LandSlope_Sev     33356.04   10549.05    3.16  0.00160 ** 
## dummy_BldgType_TwnhsE    3588.79    5000.05    0.72  0.47303    
## dummy_HouseStyle_SLvl    6660.17    5037.18    1.32  0.18631    
## dummy_RoofMatl_WdShngl  79239.40   15575.72    5.09  4.1e-07 ***
## dummy_ExterCond_TA       -181.16    3066.08   -0.06  0.95289    
## dummy_Foundation_Wood  -19857.40   21661.21   -0.92  0.35944    
## dummy_Heating_Wall      19125.02   19279.89    0.99  0.32138    
## dummy_HeatingQC_TA      -6957.83    2414.40   -2.88  0.00401 ** 
## dummy_KitchenQual_TA   -11310.92    2523.67   -4.48  8.0e-06 ***
## dummy_Functional_Typ    15338.97    4120.94    3.72  0.00021 ***
## dummy_PavedDrive_Y       9393.37    3802.26    2.47  0.01361 *  
## dummy_SaleType_WD      -10150.72    3020.44   -3.36  0.00080 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37200 on 1438 degrees of freedom
## Multiple R-squared:  0.783,  Adjusted R-squared:  0.78 
## F-statistic:  248 on 21 and 1438 DF,  p-value: <2e-16

#Residual standard error: 30460 on 1246 degrees of freedom
#  (119 observations deleted due to missingness)
#Multiple R-squared:  0.8614,   Adjusted R-squared:  0.8509 
#F-statistic: 82.35 on 94 and 1246 DF,  p-value: < 2.2e-16

The simple model comes out at about 82% R^2. For a first run got:

plot(glm.simple$residuals,glm.simple$fitted.values)

Validating the normality condition by checking variance of residuals.

x_train=data.matrix(tr3[,setdiff(model_var_list,"PoolQC")])
#x_train=data.matrix(tr3[,final_num_vars])
y_train=tr3$SalePrice
#x_test=data.matrix(subset(te3,select=setdiff(test_var_list,"PoolQC")))
x_test=data.matrix(subset(te3,select=setdiff(model_var_list,"PoolQC")))
#x_test=data.matrix(subset(te3,select=final_num_vars))

Performing advanced techniques to get more accurate results
Performing Lasso Regression as there are many variables and lasso and further regression models are able to avoid overfitting.

pred_glm=predict(glm.simple,tr3)

library(mlr)

## Loading required package: ParamHelpers

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-13

# LASSO
cv=cv.glmnet(x_train,y_train,alpha=1)
penalty=cv$lambda.min
glm.lasso=glmnet(x=x_train,y=y_train,lambda = penalty)
pred_lasso=as.numeric(predict(glm.lasso,x_train))

# RIDGE
cv=cv.glmnet(x_train,y_train,alpha=0)
penalty=cv$lambda.min
glm.ridge=glmnet(x=x_train,y=y_train,lambda = penalty)
pred_ridge=as.numeric(predict(glm.ridge,x_train))

# Random Forest  
#First_rf_vars_used: MSSubClass + OverallQual + TotalBsmtSF + X1stFlrSF + GrLivArea + FullBath + TotRmsAbvGrd + GarageCars + GarageArea + dummy_LandSlope_Sev + dummy_BldgType_TwnhsE + dummy_HouseStyle_SLvl + dummy_RoofMatl_WdShngl + dummy_ExterCond_TA + dummy_Foundation_Wood + dummy_Heating_Wall + dummy_HeatingQC_TA + dummy_Functional_Typ  + dummy_PavedDrive_Y 
require(randomForest)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

rf_fit <- randomForest(as.numeric(y_train) ~ TotalBsmtSF + FullBath + GarageArea + dummy_LandSlope_Gtl + dummy_BldgType_1Fam + dummy_BldgType_Twnhs + dummy_HouseStyle_1.5Unf + dummy_HouseStyle_2.5Unf + dummy_HouseStyle_SLvl + dummy_RoofMatl_Membran + dummy_MasVnrType_1 + dummy_MasVnrType_4 + dummy_ExterCond_Gd + dummy_Foundation_BrkTil + dummy_Foundation_Slab + dummy_BsmtCond_1 + dummy_BsmtCond_4 + dummy_BsmtExposure_3 + dummy_BsmtFinType1_2 + dummy_BsmtFinType1_5 + dummy_Heating_GasA + dummy_Heating_OthW + dummy_HeatingQC_Fa + dummy_HeatingQC_TA + dummy_KitchenQual_Gd + dummy_Functional_Maj2 + dummy_Functional_Mod + dummy_GarageFinish_1 + dummy_GarageQual_1 + dummy_GarageQual_4 + dummy_GarageCond_2 + dummy_GarageCond_5 + dummy_PavedDrive_Y + dummy_SaleType_ConLD + dummy_SaleType_CWD + dummy_SaleType_WD + X1stFlrSF + TotRmsAbvGrd + Built_to_Sold_yrs + dummy_LandSlope_Mod + dummy_BldgType_2fmCon + dummy_BldgType_TwnhsE + dummy_HouseStyle_1Story + dummy_HouseStyle_2Story + dummy_RoofMatl_ClyTile + dummy_RoofMatl_Metal + dummy_RoofMatl_WdShake + dummy_MasVnrType_2 + dummy_ExterCond_Ex + dummy_ExterCond_Po + dummy_Foundation_CBlock + dummy_Foundation_Stone + dummy_BsmtCond_2 + dummy_BsmtExposure_1 + dummy_BsmtExposure_4 + dummy_BsmtFinType1_3 + dummy_BsmtFinType1_6 + dummy_Heating_GasW + dummy_Heating_Wall + dummy_HeatingQC_Gd + dummy_KitchenQual_Ex + dummy_KitchenQual_TA + dummy_Functional_Min1 + dummy_Functional_Sev + dummy_GarageFinish_2 + dummy_GarageQual_2 + dummy_GarageQual_5 + dummy_GarageCond_3 + dummy_PavedDrive_N + dummy_SaleType_COD + dummy_SaleType_ConLI + dummy_SaleType_New + OverallQual + GrLivArea + GarageCars + yrs_since_remod + dummy_LandSlope_Sev + dummy_BldgType_Duplex + dummy_HouseStyle_1.5Fin + dummy_HouseStyle_2.5Fin + dummy_HouseStyle_SFoyer + dummy_RoofMatl_CompShg + dummy_RoofMatl_Roll + dummy_RoofMatl_WdShngl + dummy_MasVnrType_3 + dummy_ExterCond_Fa + dummy_ExterCond_TA + dummy_Foundation_PConc + dummy_Foundation_Wood + dummy_BsmtCond_3 + dummy_BsmtExposure_2 + dummy_BsmtFinType1_1 + dummy_BsmtFinType1_4 + dummy_Heating_Floor + dummy_Heating_Grav + dummy_HeatingQC_Ex + dummy_HeatingQC_Po + dummy_KitchenQual_Fa + dummy_Functional_Maj1 + dummy_Functional_Min2 + dummy_Functional_Typ + dummy_GarageFinish_3 + dummy_GarageQual_3 + dummy_GarageCond_1 + dummy_GarageCond_4 + dummy_PavedDrive_P + dummy_SaleType_Con + dummy_SaleType_ConLw + dummy_SaleType_Oth 
                    , data=x_train, importance=TRUE, ntree=2000)
pred_rf=as.numeric(predict(rf_fit,x_train))
#dummy_KitchenQual_TA column removed temporarily

# Function for Root Mean Squared Error
RMSE <- function(error) { sqrt(mean(error^2)) }

error_glm  =pred_glm-y_train
error_lasso=pred_lasso-y_train
error_ridge=pred_ridge-y_train
error_rf   =pred_rf-y_train
paste("RMSE with glm regression:", RMSE(error_glm))

## [1] "RMSE with glm regression: 36960.6062816708"

paste("RMSE with lasso regression:", RMSE(error_lasso))

## [1] "RMSE with lasso regression: 29362.5727372719"

paste("RMSE with ridge regression:", RMSE(error_ridge))

## [1] "RMSE with ridge regression: 35425.4336678416"

paste("RMSE with Random Forest:", RMSE(error_rf))

## [1] "RMSE with Random Forest: 12787.5183819935"

The total RMSE comes out to be least with Random forest algorithm on the train data. In the next section lets do the final prediction with this algorithm.

MODEL PREDICTION

#subset(te1,select=test_var_list)

# GLM
#predict_glm  =as.numeric(predict(glm.simple,te3))

# Lasso Regression
#test_var_list1<-intersect(model_var_list,test_var_list)
new_test=te3[,setdiff(model_var_list,"PoolQC")]
#new_test=te3[,final_num_vars]
x_test$SalePrice<-NULL
predict_lasso=as.numeric(predict(glm.lasso,data.matrix(new_test),type='response'))

# Ridge Regression
predict_ridge=as.numeric(predict(glm.ridge,data.matrix(new_test), family="binomial"))

# Random Forest
te4<-te3
te4$SalePrice<-0
predict_rf=as.numeric(predict(rf_fit,te4))

#write.csv(predict_glm,"D:/Boston College/MS AE Courses/Data Analysis/predict_glm.csv")
write.csv(predict_lasso, "D:/Boston College/MS AE Courses/Data Analysis/predict_lasso.csv")
write.csv(predict_ridge, "D:/Boston College/MS AE Courses/Data Analysis/predict_ridge.csv")
write.csv(predict_rf,"D:/Boston College/MS AE Courses/Data Analysis/predict_rf.csv")
#write.csv(predict_xgb, "D:/Boston College/MS AE Courses/Data Analysis/predict_xgb.csv")

I have been able to get an error of 0.18 on Kaggle platform using Ridge regression (as against a 0.62 , 0.43 and 0.21 in prior attempts). It can be viewed on my profile here - priyanka gagneja. And I continue to work on it aiming to improve it further.

LEARNINGS
I have learned to implement the advanced regression techniques in my attempt to getting good accuracy while predicting the SalePrice of ames housing. I have learned about the parameters used and also how to modify them to finetune them.
I have gotten slightly better at feature selection and want to get even better as I continue to work on this problem beyond this submission.

LIMITATIONS
Due to limited amount of time and exposure so far we have been able to attempt only some data manipulation and some model classes.

FUTURE WORK
As next steps, I want to try to implement more advanced neural net and gradient boosting algorithms along with more sophisticated feature selection and manipulation. I will be working on this even beyond the submission of this project.

Final Term - Data Analysis @ BC

Priyanka Gagneja

October 20, 2017

HOUSING PRICE PREDICTION (AMES, IOWA)