Background

In order to put newly acquired data science skills to test, l decided to explore the well-known dataset Ames Housing dataset. The Ames dataset used in this instance documents 2006-2010 data records describing the sale of individual residential property in Ames, Iowa. The Ames Housing Dataset was complied by Dean De Cock for use in data science education and publicly available as a competition hosted on Kaggle.com,in which the goal is to build a model to predict sale prices of houses sold in Ames, Iowa . With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges me to predict the final price of each home.The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values.

Kaggle provides 2 datafiles, one is a training set containing all 81 variables, and the other is a test set, which is missing the SalePrice column.

Required Libraries

Below are required libraries that are employed at various stages throughout the modelling process and also to possibly allow for the running of this notebook.

library('ggplot2')
library('ggthemes')
library('scales')
library('dplyr')
library('mice')
library('randomForest')
library('data.table')
library('gridExtra')
library('corrplot')
library('GGally')
library('e1071')
library('reshape2')
library('lares')
library('gbm')
library('MASS')
library('caret')
library('skimr')

Loading of Dataset

The two data files saved on my computer are loaded into R for the purpose of further analysis.

train = read.csv("train_reg_features-1.csv")
#Train data dimension
dim(train)
[1] 1460   81
test = read.csv("test_reg_features-1.csv")
#test data dimension
dim(test)
[1] 1459   80

Exploratoratory Data Analysis

For the purpose of exploratory analysis , the test and train data set are combined for the pupose of cleaning and analysis. The however allows me to preview the entire(combined) dataset.The combined dataset will be later separated to prevent data leakage and repetition during modelling. The combined datasset contains 2919 observations coupled with 81 variables. The training and testing data reports of 1460 and 1459 observations respectively.

test$SalePrice <-NA 

#Bind the train and test datasets for EDA/cleaning
combine_data <-rbind(train,test)
str(combine_data)
'data.frame':   2919 obs. of  81 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
 $ Alley        : chr  NA NA NA NA ...
 $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
 $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
 $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
 $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
 $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
 $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
 $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
 $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
 $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
 $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
 $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
 $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
 $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
 $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
 $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
 $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
 $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
 $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
 $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
 $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
 $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
 $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : chr  NA NA NA NA ...
 $ Fence        : chr  NA NA NA NA ...
 $ MiscFeature  : chr  NA NA NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
 $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

From the above it can be observed that all variables are either class character or integer, as such most of them needs to be recoded as numeric or factor. The project documentation suggest that there are 23 ordinal, 23 nominal, 14 discrete and 20 continuous variables.

table(sapply(combine_data, class))

character   integer 
       43        38 

Target Variable Distribution

Below is a visualization of the target variable, which is demonstrated using a histogram and a density plot.The initial plot of the target variable suggest that it is not normally distributed and hence needs to be corrected.The skewness and kurtosis as reported are 1.879009 and 6.4496789 respectively.

## Summary of Target Variable
summary(combine_data$SalePrice)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  34900  129975  163000  180921  214000  755000    1459 
##checking skewness and kurtosis
skewness(train$SalePrice)
[1] 1.879009
kurtosis(train$SalePrice)
[1] 6.496789
### Visual view of target variable
ggplot(combine_data, mapping = aes(x = SalePrice)) + geom_histogram(color = "black", fill = "lightblue")

Normalization of target variable

Target variable(SalePrice) after performing log transformation, looks much more normal.

combine_data$log_SalePrice <-log(combine_data$SalePrice)

ggplot(data = combine_data, aes(x = log_SalePrice)) +
  geom_histogram(col = "#E1AF00", fill = "#3B9AB2", bins = 50)  + 
  scale_x_continuous(breaks = seq(0, 800000, by = 100000), labels = comma) +
  labs(title = "Housing Data: Log(Sale Price)", x = "Log(Sale Price)")

Checking categorical columns

Below is a check on categorical columns in the combined dataset.


names(combine_data)[sapply(combine_data, typeof) == "character"]
 [1] "MSZoning"      "Street"        "Alley"         "LotShape"      "LandContour"   "Utilities"     "LotConfig"     "LandSlope"     "Neighborhood" 
[10] "Condition1"    "Condition2"    "BldgType"      "HouseStyle"    "RoofStyle"     "RoofMatl"      "Exterior1st"   "Exterior2nd"   "MasVnrType"   
[19] "ExterQual"     "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtExposure"  "BsmtFinType1"  "BsmtFinType2"  "Heating"      
[28] "HeatingQC"     "CentralAir"    "Electrical"    "KitchenQual"   "Functional"    "FireplaceQu"   "GarageType"    "GarageFinish"  "GarageQual"   
[37] "GarageCond"    "PavedDrive"    "PoolQC"        "Fence"         "MiscFeature"   "SaleType"      "SaleCondition"
sum(sapply(combine_data[,], typeof) == "character")
[1] 43

Checking numerical columns

Below is a view of all numerical columns found in combined dataset.

names(combine_data)[unlist(lapply(combine_data, is.numeric))]
 [1] "Id"            "MSSubClass"    "LotFrontage"   "LotArea"       "OverallQual"   "OverallCond"   "YearBuilt"     "YearRemodAdd"  "MasVnrArea"   
[10] "BsmtFinSF1"    "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "X1stFlrSF"     "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
[19] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr"  "KitchenAbvGr"  "TotRmsAbvGrd"  "Fireplaces"    "GarageYrBlt"   "GarageCars"   
[28] "GarageArea"    "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"      "MiscVal"       "MoSold"       
[37] "YrSold"        "SalePrice"     "log_SalePrice"
sum(sapply(combine_data[,], typeof) != "character")
[1] 39

Distribution Plot of Numerical Variables

#Skewed
#continuous
p1 <-ggplot(data = combine_data, aes(x = LotFrontage)) +
  geom_histogram(fill = "#3B9AB2", bins = 50)  
 

#Definitely needs to be transformed (prob on a log scale)
#continuous
p2 <-ggplot(data = combine_data, aes(x = LotArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50)  
  
#ordinal
p3<-ggplot(data = combine_data, aes(x = OverallQual)) +
  geom_bar(fill = "#3B9AB2")  


#ordinal
p4<-ggplot(data = combine_data, aes(x = OverallCond)) +
  geom_bar(fill = "#3B9AB2")  


#YearBuilt- negatively skewed, more newer houses (DISCRETE BUT STILL...)
p5<-ggplot(data =combine_data, aes(x = YearBuilt)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#interesting distribution- something weird going on with the 0s here (meaning not remodeled)
#DISCRETE
p6<-ggplot(data = combine_data, aes(x = YearRemodAdd)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#Obviously missing data issue
#continuous
p7<-ggplot(data = combine_data, aes(x = MasVnrArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#missing data (0 = no finished basement)
#continuous
p8<-ggplot(data =combine_data, aes(x = BsmtFinSF1)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
  

#same deal w/missing, would need to leave out 0s to see the real dist. here. 
#or do a log transform or whatever. 
#continuous
p9<-ggplot(data = combine_data, aes(x = BsmtFinSF2)) +
  geom_histogram(fill = "#3B9AB2", bins = 30) 


#continuous
p10<-ggplot(data = combine_data, aes(x = BsmtUnfSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 
#continuous
p11<-ggplot(data = combine_data, aes(x = TotalBsmtSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#continuous
p12<-ggplot(data = combine_data, aes(x = X1stFlrSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#LOTS of missing values here (aka lots of single-story homes)
#continuous
p13<-ggplot(data = combine_data, aes(x = X2ndFlrSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#pointless- on a log scale, or too much missing data
#maybe could set these to not = 0?
#continuous
p14 <-ggplot(data = combine_data, aes(x = LowQualFinSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 10) 


#skewed
#continuous
p15 <-ggplot(data = combine_data, aes(x = GrLivArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#discrete
p16<-ggplot(data =combine_data, aes(x = BsmtFullBath)) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p17<-ggplot(data = combine_data, aes(x = BsmtHalfBath)) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p18 <-ggplot(data = combine_data, aes(x = FullBath )) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p19 <-ggplot(data = combine_data, aes(x =HalfBath )) +
  geom_bar(fill = "#3B9AB2") 
 

#discrete
p20 <-ggplot(data = combine_data, aes(x =BedroomAbvGr )) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p21 <-ggplot(data = combine_data, aes(x =KitchenAbvGr )) +
  geom_bar(fill = "#3B9AB2") 
 

#TOTRMSABVGRD
#(discrete)
p22<-ggplot(data = combine_data, aes(x = TotRmsAbvGrd)) +
  geom_bar(fill = "#3B9AB2") 



#FIREPLACES
#(discrete)
p23<-ggplot(data = combine_data, aes(x = Fireplaces)) +
  geom_bar(fill = "#3B9AB2") 


#skewed
#DISCRETE BUT STILL...
p24 <-ggplot(data = combine_data, aes(x = GarageYrBlt)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) + 
  scale_x_continuous(limits = c(1900, 2020))


#(discrete) (this is absurd, clearly should be ordinal??)
p25<-ggplot(data = combine_data, aes(x = GarageCars)) +
  geom_bar(fill = "#3B9AB2") 
 

#0 = no garage
#CONTINUOUS
p26 <-ggplot(data = combine_data, aes(x = GarageArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#continuous
p27 <-ggplot(data = combine_data, aes(x = WoodDeckSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#continuous
p28 <-ggplot(data = combine_data, aes(x = OpenPorchSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 

 
#continuous
p29 <-ggplot(data = combine_data, aes(x = EnclosedPorch)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#continuous
p30 <-ggplot(data = combine_data, aes(x = X3SsnPorch)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#continuous
p31 <-ggplot(data = combine_data, aes(x = ScreenPorch)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#Continuous
p32 <-ggplot(data = combine_data, aes(x = PoolArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
  

#continuous
p33 <-ggplot(data = combine_data, aes(x = MiscVal)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#MO SOLD?? (DISCRETE) (RECODE)
p34 <-ggplot(data = combine_data, aes(x = MoSold)) +
  geom_bar(fill = "#3B9AB2") 


#YR SOLD (discrete)
p35 <-ggplot(data =combine_data, aes(x = YrSold)) +
  geom_bar(fill = "#3B9AB2") 

grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,
             p12,p13,p14,p15,p16,p17,p18,p19,p20,ncol = 5)

grid.arrange(p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,
             p31,p32,p33,p34,p35,ncol = 4)

From the above plots, it can be noticed that there many numeric variables that are not normally distributed coupled with other variables such as LotFrontage, MasVnrArea and other variables appearing to be skewed due to the presence of outliers.

Distribution Plot of Categorical Variables


b1<-ggplot(data = combine_data, aes(x = MSZoning)) +
  geom_bar(fill = "#3B9AB2") 
 

b2<-ggplot(data = combine_data, aes(x = Street)) +
  geom_bar(fill = "#3B9AB2") 


b3<-ggplot(data = combine_data, aes(x = Alley)) +
  geom_bar(fill = "#3B9AB2") 


b4<-ggplot(data = combine_data, aes(x = LotShape)) +
  geom_bar(fill = "#3B9AB2")  


b5 <-ggplot(data = combine_data, aes(x = LandContour)) +
  geom_bar(fill = "#3B9AB2") 



b6 <-ggplot(data = combine_data, aes(x = Utilities)) +
  geom_bar(fill = "#3B9AB2") #+ 



b7 <-ggplot(data = combine_data, aes(x = LotConfig)) +
  geom_bar(fill = "#3B9AB2") 



b8 <-ggplot(data = combine_data, aes(x = LandSlope)) +
  geom_bar(fill = "#3B9AB2") 



b9<-ggplot(data = combine_data, aes(x = Neighborhood)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 



b10 <-ggplot(data = combine_data, aes(x = Condition1)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 


b11 <-ggplot(data = combine_data, aes(x = Condition2)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 



b12 <-ggplot(data = combine_data, aes(x = BldgType)) +
  geom_bar(fill = "#3B9AB2") 


b13 <-ggplot(data = combine_data, aes(x = HouseStyle)) +
  geom_bar(fill = "#3B9AB2") 



b14 <-ggplot(data = combine_data, aes(x = RoofStyle)) +
  geom_bar(fill = "#3B9AB2") 



b15<-ggplot(data = combine_data, aes(x = RoofMatl)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 
 


b16 <-ggplot(data = combine_data, aes(x = Exterior1st)) +
  geom_bar(fill = "#3B9AB2") + coord_flip()
  


b17 <-ggplot(data = combine_data, aes(x = Exterior2nd)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 
 


b18 <-ggplot(data = combine_data, aes(x = MasVnrType)) +
  geom_bar(fill = "#3B9AB2") 



b19 <-ggplot(data = combine_data, aes(x = ExterQual)) +
  geom_bar(fill = "#3B9AB2") 



b20 <-ggplot(data = combine_data, aes(x = ExterCond)) +
  geom_bar(fill = "#3B9AB2") 



b21 <-ggplot(data = combine_data, aes(x = Foundation)) +
  geom_bar(fill = "#3B9AB2") 


b22 <-ggplot(data = combine_data, aes(x = BsmtQual)) +
  geom_bar(fill = "#3B9AB2") 

b23 <-ggplot(data = combine_data, aes(x = BsmtCond)) +
  geom_bar(fill = "#3B9AB2") 
 


b24 <-ggplot(data = combine_data, aes(x = BsmtExposure)) +
  geom_bar(fill = "#3B9AB2") 


b25 <-ggplot(data = combine_data, aes(x = BsmtFinType1)) +
  geom_bar(fill = "#3B9AB2") 
 


b26 <-ggplot(data = combine_data, aes(x = BsmtFinType2)) +
  geom_bar(fill = "#3B9AB2") 
  

b27 <-ggplot(data = combine_data, aes(x = Heating)) +
  geom_bar(fill = "#3B9AB2") 



b28 <-ggplot(data = combine_data, aes(x = HeatingQC)) +
  geom_bar(fill = "#3B9AB2") 


b29 <-ggplot(data =combine_data, aes(x = CentralAir)) +
  geom_bar(fill = "#3B9AB2") 
  


b30 <-ggplot(data = combine_data, aes(x = Electrical)) +
  geom_bar(fill = "#3B9AB2")  


b31 <-ggplot(data = combine_data, aes(x = KitchenQual)) +
  geom_bar(fill = "#3B9AB2") 



b32<-ggplot(data = combine_data, aes(x = Functional)) +
  geom_bar(fill = "#3B9AB2") 



b33<-ggplot(data = combine_data, aes(x = FireplaceQu)) +
  geom_bar(fill = "#3B9AB2") 


b34<-ggplot(data = combine_data, aes(x = GarageType)) +
  geom_bar(fill = "#3B9AB2") 


b35 <-ggplot(data = combine_data, aes(x = GarageFinish)) +
  geom_bar(fill = "#3B9AB2") 



b36 <-ggplot(data = combine_data, aes(x = GarageQual)) +
  geom_bar(fill = "#3B9AB2") 
grid.arrange(b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,ncol = 4)

grid.arrange(b11,b12,b13,b14,b15,b16,b17,b18,b19,b20,ncol = 4)

grid.arrange(b21,b22,b23,b24,b25,b26,b27,b28,b29,b30,b31,b32,b33,b34,b35,b36,ncol = 4)

From the above some categorical variables display evidence of non-normality and (multi)collineraity whiles others variables may be useful features in predicting the target variable. It can also be noted that are significant amount of missing data values.Many of these NA observations are likely to be structurally missing for a logical reason. For instance GarageArea= NA, suggest that the property has no garage. As such missing data will require imputation.

Splitting Data into Numerical and Categorical Variables

For the purpose of visualization, the combined dataset is split into numerical and categorical variables. Also one train dataset with categorical and numeric variable is created

cat_var <- names(combine_data)[which(sapply(combine_data, is.character))]
numeric_var <- names(combine_data)[which(sapply(combine_data, is.numeric))]
train1_cat<-combine_data[cat_var]
train1_num<-combine_data[numeric_var]

Bivariate Plots(numeric)

This section looks at each variable and how they relate to the target variable(salePrice)

# Overall Quality vs Sale Price
# unique(train$OverallQual)
ggplot(train, mapping = aes(x = factor(OverallQual), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

From the above plot, it can be suggested that people are willing to pay more for better quality.

# Living Area vs Sale Price
ggplot(train, mapping = aes(x = GrLivArea, y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")

From the above it can be suggested that people are would pay more for living area. However its not reasonable for people to pay less for large area as two data point in the bottom-right of the plot. These two points however needs to be removed.

# Removing outliers manually (The two points in the bottom right)
train = train[train$GrLivArea<=4500,]
ggplot(train, mapping = aes(x = GrLivArea, y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")

# Garage Area vs Sale Price
unique(train$GarageCars)
[1] 2 3 1 0 4
ggplot(train, mapping = aes(x = factor(GarageCars), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

The relationship between garagecars and saleprice suggest that 4-car garages result in less sale Price which might be as a result of outliers. These outliers must be removed.

train = train[train$GarageCars < 4, ]
ggplot(train, mapping = aes(x = factor(GarageCars), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

# Garage Area vs Sale Price
ggplot(train, mapping = aes(x = GarageArea , y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")

##Again, the two data points at the bottom does not make sense
train = filter(train, GarageArea < 1240)
ggplot(train, mapping = aes(x = GarageArea , y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")

# Basement Area vs Sale Price
ggplot(train, mapping = aes(x = TotalBsmtSF , y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")

# First Floor Area vs Sale Price
ggplot(train, mapping = aes(x = X1stFlrSF, y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")

# # ExterQual vs Sale Price
ggplot(train, mapping = aes(x = factor(ExterQual), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

The above plot suggest that, houses with excellent external quality are expensive whereas houses with fair quality material are less expensive.

# ExterQual vs Sale Price
ggplot(train, mapping = aes(x = factor(FullBath), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

From the above plot it is not reasonable that houses with zero full bath are expensive than houses with one full bath.This will be an outlier and must be removed from the data set.

train = train[train$FullBath > 0 , ]
ggplot(train, mapping = aes(x = factor(FullBath), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

#BsmtQual and SalePrice
ggplot(train, mapping = aes(x = factor(BsmtQual), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

# Total Rooms vs Sale Price
ggplot(train, mapping = aes(x = factor(TotRmsAbvGrd), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")+ scale_y_discrete()

Correlation Matrix

correlations <- cor(na.omit(train1_num[,-1]))
head(correlations, 2)
            MSSubClass LotFrontage    LotArea OverallQual OverallCond  YearBuilt YearRemodAdd MasVnrArea  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF TotalBsmtSF
MSSubClass   1.0000000  -0.3869396 -0.1980955  0.02952186 -0.08785932 0.02579968  0.006645194  0.0402400 -0.07038869 -0.07543900 -0.1455823  -0.2477812
LotFrontage -0.3869396   1.0000000  0.4211841  0.24132232 -0.04631165 0.10972557  0.086413968  0.1899686  0.24135223  0.04930532  0.1153059   0.3876195
             X1stFlrSF X2ndFlrSF LowQualFinSF  GrLivArea BsmtFullBath BsmtHalfBath  FullBath   HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd  Fireplaces
MSSubClass  -0.2522489 0.3193276   0.02470365 0.08336538   -0.0146813 0.0123095694 0.1312780 0.20397061  -0.03297115  0.266012356   0.04720943 -0.03112223
LotFrontage  0.4510850 0.0750038   0.01114809 0.39630602    0.1180881 0.0004335725 0.1857853 0.04567835   0.27040389 -0.003546472   0.34842111  0.26032083
            GarageYrBlt  GarageCars  GarageArea  WoodDeckSF OpenPorchSF EnclosedPorch  X3SsnPorch ScreenPorch    PoolArea      MiscVal      MoSold
MSSubClass   0.05470137 -0.02741067 -0.09260726 -0.01798842  0.00405397   -0.01778990 -0.03973896 -0.02178934 0.003166468 -0.040688673 -0.02717038
LotFrontage  0.06987812  0.28658681  0.35685094  0.08216563  0.16181512    0.01426101  0.06971577  0.03590598 0.211746117  0.001470738  0.01881453
                 YrSold  SalePrice log_SalePrice
MSSubClass  -0.01244783 -0.0880317   -0.08376839
LotFrontage  0.01326707  0.3442698    0.35361138
melted_correlations = melt(correlations)
head(melted_correlations, 2)


ggplot(data = melted_correlations, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Pearson\nCorrelation") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.4,
                                   size = 12, hjust = 0.2))+ theme(aspect.ratio = 1) +
  coord_fixed()

The above correlation plot is too clustered, hence lets consider the top 10 correlated variables.

Top 10 correlated variables

corr_cross(train, # name of dataset
           max_pvalue = 0.05, # display only significant correlations (at 5% level)
           top = 10 # display top 10 couples of variables (by correlation coefficient)
)
Returning only the top 10. You may override with the 'top' argument

Top 10 correlated variables with Sale Price

top_10 = corr_var(train, # name of dataset
                  SalePrice, # name of variable to focus on
                  top = 10 # display top 5 correlations
)
top_10

From the above, Blue indicate positive correlation and red indicate negative correlations. ExterQual_TA from the plot negatively correlates with the target variable.

Data Cleaning

Before performing modeling, all variables with missing data are identified and corrected.

test$SalePrice <- NA
train$isTrain <- 1
test$isTrain <- 0
combine_data <-rbind(train,test)
head(combine_data,2)



Missing_indices <- sapply(combine_data,function(x) sum(is.na(x)))/nrow(combine_data)

Missing_Summary <- data.frame(index = names(combine_data),Missing_Values=Missing_indices)

Missing_Summary[order(Missing_Summary$Missing_Values > 0, decreasing = TRUE),]
### percentage of missing cases
sum(is.na(combine_data))/ (dim(combine_data)[1] *dim(combine_data)[2]) *100 
[1] 6.448575
all_data_na = Missing_Summary[order(Missing_Summary$Missing_Values, decreasing = TRUE), ]
all_data_na[1:5, ]

all_data <- combine_data

Inputing Missing Values

From the above variables with missing values are fixed since most of theme are actually not missing.

#Getmode

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
#PoolQC
#Changing NA in PoolQC to None
all_data$PoolQC1 <- as.character(all_data$PoolQC)
all_data$PoolQC1[which(is.na(all_data$PoolQC))] <- "None"
all_data$PoolQC <- as.factor(all_data$PoolQC1)
all_data <- subset(all_data,select = -PoolQC1)
sum(is.na(all_data$PoolQC))
[1] 0
#MiscFeature
all_data$MiscFeature1 <- as.character(all_data$MiscFeature)
all_data$MiscFeature1[which(is.na(all_data$MiscFeature))] <- "None"
all_data$MiscFeature <- as.factor(all_data$MiscFeature1)
all_data <- subset(all_data,select = -MiscFeature1)
sum(is.na(all_data$MiscFeature))
[1] 0
#Alley
all_data$Alley1 <- as.character(all_data$Alley)
all_data$Alley1[which(is.na(all_data$Alley))] <- "None"
all_data$Alley <- as.factor(all_data$Alley1)
all_data <- subset(all_data,select = -Alley1)
sum(is.na(all_data$Alley))
[1] 0
#Fence
all_data$Fence1 <- as.character(all_data$Fence)
all_data$Fence1[which(is.na(all_data$Fence))] <- "None"
all_data$Fence <- as.factor(all_data$Fence1)
all_data <- subset(all_data,select = -Fence1)
sum(is.na(all_data$Fence))
[1] 0
#FireplaceQu
all_data$FireplaceQu1 <- as.character(all_data$FireplaceQu)
all_data$FireplaceQu1[which(is.na(all_data$FireplaceQu))] <- "None"
all_data$FireplaceQu <- as.factor(all_data$FireplaceQu1)
all_data <- subset(all_data,select = -FireplaceQu1)
sum(is.na(all_data$FireplaceQu))
[1] 0
unique(all_data$FireplaceQu)
[1] None TA   Gd   Fa   Ex   Po  
Levels: Ex Fa Gd None Po TA
#LotFrontage
all_data$LotFrontage[which(is.na(all_data$LotFrontage))] <- median(all_data$LotFrontage,na.rm = T)
sum(is.na(all_data$LotFrontage))
[1] 0
#GarageType
all_data$GarageType1 <- as.character(all_data$GarageType)
all_data$GarageType1[which(is.na(all_data$GarageType))] <- "None"
all_data$GarageType <- as.factor(all_data$GarageType1)
all_data <- subset(all_data,select = -GarageType1)
sum(is.na(all_data$GarageType))
[1] 0
#GarageFinish
all_data$GarageFinish1 <- as.character(all_data$GarageFinish)
all_data$GarageFinish1[which(is.na(all_data$GarageFinish))] <- "None"
all_data$GarageFinish <- as.factor(all_data$GarageFinish1)
all_data <- subset(all_data,select = -GarageFinish1)
sum(is.na(all_data$GarageFinish))
[1] 0
#GarageQual
all_data$GarageQual1 <- as.character(all_data$GarageQual)
all_data$GarageQual1[which(is.na(all_data$GarageQual))] <- "None"
all_data$GarageQual <- as.factor(all_data$GarageQual1)
all_data <- subset(all_data,select = -GarageQual1)

#GarageCond
all_data$GarageCond1 <- as.character(all_data$GarageCond)
all_data$GarageCond1[which(is.na(all_data$GarageCond))] <- "None"
all_data$GarageCond <- as.factor(all_data$GarageCond1)
all_data <- subset(all_data,select = -GarageCond1)

unique(all_data$GarageCond)
[1] TA   Fa   None Gd   Po   Ex  
Levels: Ex Fa Gd None Po TA
#GarageYrBlt
all_data$GarageYrBlt[which(is.na(all_data$GarageYrBlt))] <- 0
sum(is.na(all_data$GarageYrBlt))
[1] 0
sum(is.na(all_data$GarageCond))
[1] 0
# head(all_data, 2)
unique(all_data$GarageYrBlt)
  [1] 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006 1960 1991 1970 1967 1958 1930 2002 1968 2007 2008 1957 1920 1966 1959 1995 1954
 [31] 1953    0 1983 1977 1997 1985 1963 1964 1999 1935 1990 1945 1987 1989 1915 1956 1948 1974 2009 1950 1961 1921 1900 1979 1951 1969 1936 1975 1971 1923
 [61] 1984 1926 1955 1981 1986 1988 1916 1932 1972 1918 1980 1924 1996 1940 1949 1994 1910 1978 1982 1992 1925 1941 2010 1927 1947 1937 1942 1938 1952 1928
 [91] 1922 1934 1906 1914 1946 1908 1929 1933 1917 1896 1895 2207 1943 1919
#GarageArea
all_data$GarageArea[which(is.na(all_data$GarageArea))] <- 0
#GarageCars
all_data$GarageCars[which(is.na(all_data$GarageCars))] <- 0
#BsmtFinSF1
all_data$BsmtFinSF1[which(is.na(all_data$BsmtFinSF1))] <- 0
#BsmtFinSF2
all_data$BsmtFinSF2[which(is.na(all_data$BsmtFinSF2))] <- 0
#BsmtUnfSF
all_data$BsmtUnfSF[which(is.na(all_data$BsmtUnfSF))] <- 0
#TotalBsmtSF
all_data$TotalBsmtSF[which(is.na(all_data$TotalBsmtSF))] <- 0
#BsmtFullBath
all_data$BsmtFullBath[which(is.na(all_data$BsmtFullBath))] <- 0
#BsmtHalfBath
all_data$BsmtHalfBath[which(is.na(all_data$BsmtHalfBath))] <- 0
#BsmtQual
all_data$BsmtQual1 <- as.character(all_data$BsmtQual)
all_data$BsmtQual1[which(is.na(all_data$BsmtQual))] <- "None"
all_data$BsmtQual <- as.factor(all_data$BsmtQual1)
all_data <- subset(all_data,select = -BsmtQual1)
#BsmtCond
all_data$BsmtCond1 <- as.character(all_data$BsmtCond)
all_data$BsmtCond1[which(is.na(all_data$BsmtCond))] <- "None"
all_data$BsmtCond <- as.factor(all_data$BsmtCond1)
all_data <- subset(all_data,select = -BsmtCond1)

#BsmtExposure
all_data$BsmtExposure1 <- as.character(all_data$BsmtExposure)
all_data$BsmtExposure1[which(is.na(all_data$BsmtExposure))] <- "None"
all_data$BsmtExposure <- as.factor(all_data$BsmtExposure1)
all_data <- subset(all_data, select = -BsmtExposure1)

#BsmtFinType1
all_data$BsmtFinType11 <- as.character(all_data$BsmtFinType1)
all_data$BsmtFinType11[which(is.na(all_data$BsmtFinType1))] <- "None"
all_data$BsmtFinType1 <- as.factor(all_data$BsmtFinType11)
all_data <- subset(all_data,select = -BsmtFinType11)

#BsmtFinType2
all_data$BsmtFinType21 <- as.character(all_data$BsmtFinType2)
all_data$BsmtFinType21[which(is.na(all_data$BsmtFinType2))] <- "None"
all_data$BsmtFinType2 <- as.factor(all_data$BsmtFinType21)
all_data <- subset(all_data,select = -BsmtFinType21)
#MasVnrType

all_data$MasVnrType1 <- as.character(all_data$MasVnrType)
all_data$MasVnrType1[which(is.na(all_data$MasVnrType))] <- "None"
all_data$MasVnrType <- as.factor(all_data$MasVnrType1)
all_data <- subset(all_data,select = -MasVnrType1)

#MasVnrArea
all_data$MasVnrArea[which(is.na(all_data$MasVnrArea))] <- 0

#MSZoning
all_data$MSZoning1 <- as.character(all_data$MSZoning)
all_data$MSZoning1[which(is.na(all_data$MSZoning))] <- getmode(all_data$MSZoning)
all_data$MSZoning <- as.factor(all_data$MSZoning1)
all_data <- subset(all_data,select = -MSZoning1)


#Utilities We can safely drop this feature since most of the observations are “AllPub”, 1 is “NoSewa”, and 2 “NA”
unique(all_data$Utilities)
[1] "AllPub" "NoSeWa" NA      
## [1] "AllPub" "NoSeWa" NA
table(all_data$Utilities)

AllPub NoSeWa 
  2898      1 
all_data$Utilities = NULL

#Functional
all_data$Functional1 <- as.character(all_data$Functional)
all_data$Functional1[which(is.na(all_data$Functional))] <- "Typ"
all_data$Functional <- as.factor(all_data$Functional1)
all_data <- subset(all_data,select = -Functional1)
#table(all_data$Functional)

#Electrical
all_data$Electrical1 <- as.character(all_data$Electrical)
all_data$Electrical1[which(is.na(all_data$Electrical))] <- getmode(all_data$Electrical)
all_data$Electrical <- as.factor(all_data$Electrical1)
all_data <- subset(all_data,select = -Electrical1)

#KitchenQual
all_data$KitchenQual1 <- as.character(all_data$KitchenQual)
all_data$KitchenQual1[which(is.na(all_data$KitchenQual))] <- getmode(all_data$KitchenQual)
all_data$KitchenQual <- as.factor(all_data$KitchenQual1)
all_data <- subset(all_data,select = -KitchenQual1)

#Exterior1st
all_data$Exterior1st1 <- as.character(all_data$Exterior1st)
all_data$Exterior1st1[which(is.na(all_data$Exterior1st))] <- getmode(all_data$Exterior1st)
all_data$Exterior1st <- as.factor(all_data$Exterior1st1)
all_data <- subset(all_data,select = -Exterior1st1)

#Exterior2nd
all_data$Exterior2nd1 <- as.character(all_data$Exterior2nd)
all_data$Exterior2nd1[which(is.na(all_data$Exterior2nd))] <- getmode(all_data$Exterior2nd)
all_data$Exterior2nd <- as.factor(all_data$Exterior2nd1)
all_data <- subset(all_data,select = -Exterior2nd1)

#SaleType
all_data$SaleType1 <- as.character(all_data$SaleType)
all_data$SaleType1[which(is.na(all_data$SaleType))] <- getmode(all_data$SaleType)
all_data$SaleType <- as.factor(all_data$SaleType1)
all_data <- subset(all_data,select = -SaleType1)
head(all_data,2)

Converting to Factors

all_data$Street = factor(all_data$Street)
all_data$LotShape = factor(all_data$LotShape)
all_data$LandContour = factor(all_data$LandContour)
all_data$LotConfig = factor(all_data$LotConfig)
all_data$LandSlope = factor(all_data$LandSlope)
all_data$Neighborhood = factor(all_data$Neighborhood)
all_data$Condition1 = factor(all_data$Condition1)
all_data$Condition2 = factor(all_data$Condition2)
all_data$BldgType = factor(all_data$BldgType)
all_data$HouseStyle = factor(all_data$HouseStyle)
all_data$OverallQual = factor(all_data$OverallQual)
all_data$OverallCond = factor(all_data$OverallCond)
all_data$YearBuilt = as.character((all_data$YearBuilt))
all_data$GarageYrBlt = as.character((all_data$GarageYrBlt))
all_data$YearRemodAdd = as.character(all_data$YearRemodAdd)
all_data$RoofStyle = factor(all_data$RoofStyle)
all_data$RoofMatl = factor(all_data$RoofMatl)
all_data$ExterQual = factor(all_data$ExterQual)
all_data$ExterCond = factor(all_data$ExterCond)
all_data$MSSubClass = factor(all_data$MSSubClass)
all_data$Foundation = factor(all_data$Foundation)
all_data$Heating = factor(all_data$Heating)
all_data$HeatingQC = factor(all_data$HeatingQC)
all_data$CentralAir = factor(all_data$CentralAir)
all_data$PavedDrive = factor(all_data$PavedDrive)
all_data$MoSold = factor(all_data$MoSold)
all_data$YrSold = factor(all_data$YrSold)
all_data$SaleCondition = factor(all_data$SaleCondition)
all_data$BsmtFullBath = factor(all_data$BsmtFullBath)
all_data$BsmtHalfBath = factor(all_data$BsmtHalfBath)
all_data$FullBath = factor(all_data$FullBath)
all_data$HalfBath = factor(all_data$HalfBath)
head(all_data, 2)
#Determining skew of each numeric variable

Column_classes <- sapply(names(all_data),function(x){class(all_data[[x]])})
numeric_columns <-names(Column_classes[Column_classes != "factor" & Column_classes != "character"])

skew <- sapply(numeric_columns,function(x){skewness(all_data[[x]],na.rm = T)})
# Let us determine a threshold skewness and transform all variables above the treshold.
skew <- skew[skew > 0.75]
# transform excessively skewed features with log(x + 1)
for(x in names(skew))
{
  all_data[[x]] <- log(all_data[[x]] + 1)
}
ggplot(all_data, mapping = aes(x = SalePrice)) + geom_histogram(color = "black", fill = "blue")

Feature Engineering

Here it can be suggested that total basement surface area, 1st floor surface area and 2nd floor surface area can be equated to total surface area. Based on this TotalSF is created.

#linear relationship b/n ground and upper floors
all_data$TotalSF = all_data$TotalBsmtSF + all_data$X1stFlrSF + all_data$X2ndFlrSF

#all_data <-subset(all_data, select = -c(log_SalePrice))
which(colSums(is.na(all_data))>0)
SalePrice 
       80 
dim(all_data)
[1] 2901   82

Data Partition

train_1 <- all_data[all_data$isTrain==1,]
test_1 <- all_data[all_data$isTrain==0,]
dim(train_1)
[1] 1442   82
dim(test_1)
[1] 1459   82
dim(all_data)
[1] 2901   82
train_1 = subset(train_1, select = -c(Id, isTrain))
test_1 = subset(test_1, select = -c(Id, isTrain, SalePrice))
set.seed(100)
partition <- createDataPartition(train_1$SalePrice, p = 0.80, list = FALSE)
train.m <- train_1[partition, ]
test.m <- train_1[-partition, ]

head(train.m, 3)

Modeling

RandomForest Model

###FULL MODEL ###
model1 = randomForest(SalePrice ~ ., data = train.m)
summary(model1)
                Length Class  Mode     
call               3   -none- call     
type               1   -none- character
predicted       1155   -none- numeric  
mse              500   -none- numeric  
rsq              500   -none- numeric  
oob.times       1155   -none- numeric  
importance        79   -none- numeric  
importanceSD       0   -none- NULL     
localImportance    0   -none- NULL     
proximity          0   -none- NULL     
ntree              1   -none- numeric  
mtry               1   -none- numeric  
forest            11   -none- list     
coefs              0   -none- NULL     
y               1155   -none- numeric  
test               0   -none- NULL     
inbag              0   -none- NULL     
terms              3   terms  call     

RF Model Plot

plot(model1)

Variable Importance

imp = importance(model1)
imp
              IncNodePurity
MSSubClass      3.149091652
MSZoning        1.110581653
LotFrontage     0.682229113
LotArea         2.288074115
Street          0.019590196
Alley           0.149008333
LotShape        0.175011443
LandContour     0.157752933
LotConfig       0.126068901
LandSlope       0.159013064
Neighborhood   22.184734180
Condition1      0.132893897
Condition2      0.009086202
BldgType        0.150736037
HouseStyle      0.359474363
OverallQual    40.360696664
OverallCond     0.945470293
YearBuilt       6.505207497
YearRemodAdd    1.142558284
RoofStyle       0.173775477
RoofMatl        0.025706225
Exterior1st     0.896598040
Exterior2nd     0.986840753
MasVnrType      0.121627340
MasVnrArea      0.381078051
ExterQual      11.813144281
ExterCond       0.264751537
Foundation      0.144461509
BsmtQual        3.123396609
BsmtCond        0.163381763
BsmtExposure    0.353701095
BsmtFinType1    0.956226487
BsmtFinSF1      1.773458995
BsmtFinType2    0.207345659
BsmtFinSF2      0.081145774
BsmtUnfSF       0.592320119
TotalBsmtSF     4.056414420
Heating         0.030625280
HeatingQC       0.241954812
CentralAir      0.394723268
Electrical      0.098157843
X1stFlrSF       4.811371727
X2ndFlrSF       1.594500473
LowQualFinSF    0.026265876
GrLivArea      20.177404091
BsmtFullBath    0.185131635
BsmtHalfBath    0.026164667
FullBath        3.595580973
HalfBath        0.187422288
BedroomAbvGr    0.286169709
KitchenAbvGr    0.068327509
KitchenQual     3.086390979
TotRmsAbvGrd    0.850242086
Functional      0.153525048
Fireplaces      0.792973180
FireplaceQu     1.813495808
GarageType      2.373585373
GarageYrBlt     2.228938668
GarageFinish    1.337870492
GarageCars      5.218971747
GarageArea      5.461616387
GarageQual      0.335461870
GarageCond      0.812706770
PavedDrive      0.186963858
WoodDeckSF      0.335663339
OpenPorchSF     0.688378240
EnclosedPorch   0.209572910
X3SsnPorch      0.019876516
ScreenPorch     0.091972942
PoolArea        0.003518593
PoolQC          0.005986523
Fence           0.129436614
MiscFeature     0.021427048
MiscVal         0.030886911
MoSold          1.909772661
YrSold          0.486792859
SaleType        0.113243862
SaleCondition   0.362717834
TotalSF         5.234178681

Plot of Variable Importance

varImpPlot(model1)

# RMSE of full model using log of SalePrice
RMSE(test.m$SalePrice, predict(model1, newdata = test.m))
[1] 0.1617629

Prediction with the full model

pred = predict(model1, newdata = test.m)
df = data.frame(Pred = exp(pred), Test = exp(test.m$SalePrice))

Prediction VS Actual

head(df, 5)

RMSE of the full model using SalePrice

sqrt( mean( (exp(pred)-exp(test.m$SalePrice)) ^2) )
[1] 32233.15

Reduced Model

train.rf = train.m
model.select.rf = randomForest(SalePrice ~ TotalSF + MSZoning + LotArea + LotShape + Neighborhood + Condition1 +Condition2+
                                 BldgType +  OverallQual +OverallCond + YearBuilt + RoofMatl + Exterior1st + ExterCond+
                                 Foundation +BsmtQual+ KitchenQual+FullBath+BsmtExposure+Fireplaces+ SaleCondition+
                                 CentralAir+PavedDrive,data = train.rf)  
summary(model.select.rf)
                Length Class  Mode     
call               3   -none- call     
type               1   -none- character
predicted       1155   -none- numeric  
mse              500   -none- numeric  
rsq              500   -none- numeric  
oob.times       1155   -none- numeric  
importance        23   -none- numeric  
importanceSD       0   -none- NULL     
localImportance    0   -none- NULL     
proximity          0   -none- NULL     
ntree              1   -none- numeric  
mtry               1   -none- numeric  
forest            11   -none- list     
coefs              0   -none- NULL     
y               1155   -none- numeric  
test               0   -none- NULL     
inbag              0   -none- NULL     
terms              3   terms  call     
###model plot
plot(model.select.rf)

variable of importance Reduced Model

imp = importance(model.select.rf)
imp
              IncNodePurity
TotalSF         14.98225966
MSZoning         2.11368832
LotArea         10.07284516
LotShape         0.91405716
Neighborhood    30.33837965
Condition1       0.75819870
Condition2       0.07793706
BldgType         1.09039790
OverallQual     44.97055960
OverallCond      2.69011137
YearBuilt       16.10910241
RoofMatl         0.21277160
Exterior1st      3.69241605
ExterCond        0.71684523
Foundation       2.00885548
BsmtQual         7.51937808
KitchenQual     11.26561590
FullBath        10.38556873
BsmtExposure     1.67606403
Fireplaces       5.93462656
SaleCondition    1.37223695
CentralAir       1.06931075
PavedDrive       0.92066421

Plot of Variable of Importance

varImpPlot(model.select.rf)

RMSE of the Reduced model using log of SalePrice

RMSE(test.m$SalePrice, predict(model.select.rf, newdata = test.m))
[1] 0.1831383

Prediction vs Actual

#####Prediction with the reduced model###
pred = predict(model.select.rf, newdata = test.m)

########
df = data.frame(Pred = exp(pred), Test = exp(test.m$SalePrice))
head(df, 5)

RMSE of the Reduced model using SalePrice

pred = predict(model.select.rf, newdata = test.m)


sqrt( mean( (exp(pred)-exp(test.m$SalePrice)) ^2) )
[1] 39366.59

Given the RMSE of the full model and the Reduce model,it can be suggested that the full model returns a lower RMSE when compared to the reduced model, hence will cost less to deploy.

---
title: "Kaggle Project: Predicting Ames House Prices"
author: "Charles Wiredu"
date: "2022-10-18"
output:
  html_notebook:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Background

In order to put newly acquired data science skills to test, l decided to explore the well-known dataset Ames Housing dataset. The Ames dataset used in this instance documents 2006-2010 data records describing the sale of individual residential property in Ames, Iowa. The Ames Housing Dataset was complied by Dean De Cock for use in data science education and publicly available as a competition hosted  on Kaggle.com,in which the goal is to build a model to predict sale prices of houses sold in Ames, Iowa . With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges me to predict the final price of each home.The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values.

Kaggle provides 2 datafiles, one is a training set containing all 81 variables, and the other is a test set, which is missing the SalePrice column.

## Required Libraries
Below are required libraries that are employed at various stages throughout the modelling process and also  to possibly allow for the running of this notebook.
```{r}
library('ggplot2')
library('ggthemes')
library('scales')
library('dplyr')
library('mice')
library('randomForest')
library('data.table')
library('gridExtra')
library('corrplot')
library('GGally')
library('e1071')
library('reshape2')
library('lares')
library('gbm')
library('MASS')
library('caret')
library('skimr')
```

# Loading of Dataset
The two data files saved on my computer are loaded into R for the purpose of further analysis.
```{r}
train = read.csv("train_reg_features-1.csv")
#Train data dimension
dim(train)
test = read.csv("test_reg_features-1.csv")
#test data dimension
dim(test)
```

# Exploratoratory Data Analysis
For the purpose of exploratory analysis , the test and train data set are combined for the pupose of cleaning and analysis. The however allows me to preview the entire(combined) dataset.The combined dataset will be later separated to prevent data leakage and repetition during modelling. The combined datasset contains 2919 observations coupled with 81 variables. The training and testing data reports of 1460 and 1459 observations respectively.
```{r}
test$SalePrice <-NA 

#Bind the train and test datasets for EDA/cleaning
combine_data <-rbind(train,test)
```

```{r}
str(combine_data)
```

From the above it can be observed that all variables are either class character or integer, as such most of them needs to be recoded as numeric or factor. The project documentation suggest that there are 23 ordinal, 23 nominal, 14 discrete and 20 continuous variables.
```{r}
table(sapply(combine_data, class))
```
## Target Variable Distribution
Below is a visualization of the target variable, which is demonstrated using a histogram and a density plot.The initial plot of the target variable suggest that it is not normally distributed and hence needs to be corrected.The skewness and kurtosis as reported are 1.879009 and 6.4496789 respectively.
```{r}
## Summary of Target Variable
summary(combine_data$SalePrice)

##checking skewness and kurtosis
skewness(train$SalePrice)
kurtosis(train$SalePrice)
```
```{r}
### Visual view of target variable
ggplot(combine_data, mapping = aes(x = SalePrice)) + geom_histogram(color = "black", fill = "lightblue")
```

## Normalization of target variable
Target variable(SalePrice) after performing log transformation, looks much more normal.
```{r}
combine_data$log_SalePrice <-log(combine_data$SalePrice)

ggplot(data = combine_data, aes(x = log_SalePrice)) +
  geom_histogram(col = "#E1AF00", fill = "#3B9AB2", bins = 50)  + 
  scale_x_continuous(breaks = seq(0, 800000, by = 100000), labels = comma) +
  labs(title = "Housing Data: Log(Sale Price)", x = "Log(Sale Price)")
```
## Checking categorical columns
Below is a check on categorical columns in the combined dataset.
```{r}

names(combine_data)[sapply(combine_data, typeof) == "character"]
sum(sapply(combine_data[,], typeof) == "character")

```
## Checking numerical columns
Below is a view of all numerical columns found in combined dataset.
```{r}
names(combine_data)[unlist(lapply(combine_data, is.numeric))]
sum(sapply(combine_data[,], typeof) != "character")
```

## Distribution Plot of Numerical Variables
```{r}
#Skewed
#continuous
p1 <-ggplot(data = combine_data, aes(x = LotFrontage)) +
  geom_histogram(fill = "#3B9AB2", bins = 50)  
 

#Definitely needs to be transformed (prob on a log scale)
#continuous
p2 <-ggplot(data = combine_data, aes(x = LotArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50)  
  
#ordinal
p3<-ggplot(data = combine_data, aes(x = OverallQual)) +
  geom_bar(fill = "#3B9AB2")  


#ordinal
p4<-ggplot(data = combine_data, aes(x = OverallCond)) +
  geom_bar(fill = "#3B9AB2")  


#YearBuilt- negatively skewed, more newer houses (DISCRETE BUT STILL...)
p5<-ggplot(data =combine_data, aes(x = YearBuilt)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#interesting distribution- something weird going on with the 0s here (meaning not remodeled)
#DISCRETE
p6<-ggplot(data = combine_data, aes(x = YearRemodAdd)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#Obviously missing data issue
#continuous
p7<-ggplot(data = combine_data, aes(x = MasVnrArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#missing data (0 = no finished basement)
#continuous
p8<-ggplot(data =combine_data, aes(x = BsmtFinSF1)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
  

#same deal w/missing, would need to leave out 0s to see the real dist. here. 
#or do a log transform or whatever. 
#continuous
p9<-ggplot(data = combine_data, aes(x = BsmtFinSF2)) +
  geom_histogram(fill = "#3B9AB2", bins = 30) 


#continuous
p10<-ggplot(data = combine_data, aes(x = BsmtUnfSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 
#continuous
p11<-ggplot(data = combine_data, aes(x = TotalBsmtSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#continuous
p12<-ggplot(data = combine_data, aes(x = X1stFlrSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#LOTS of missing values here (aka lots of single-story homes)
#continuous
p13<-ggplot(data = combine_data, aes(x = X2ndFlrSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#pointless- on a log scale, or too much missing data
#maybe could set these to not = 0?
#continuous
p14 <-ggplot(data = combine_data, aes(x = LowQualFinSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 10) 


#skewed
#continuous
p15 <-ggplot(data = combine_data, aes(x = GrLivArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#discrete
p16<-ggplot(data =combine_data, aes(x = BsmtFullBath)) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p17<-ggplot(data = combine_data, aes(x = BsmtHalfBath)) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p18 <-ggplot(data = combine_data, aes(x = FullBath )) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p19 <-ggplot(data = combine_data, aes(x =HalfBath )) +
  geom_bar(fill = "#3B9AB2") 
 

#discrete
p20 <-ggplot(data = combine_data, aes(x =BedroomAbvGr )) +
  geom_bar(fill = "#3B9AB2") 


#discrete
p21 <-ggplot(data = combine_data, aes(x =KitchenAbvGr )) +
  geom_bar(fill = "#3B9AB2") 
 

#TOTRMSABVGRD
#(discrete)
p22<-ggplot(data = combine_data, aes(x = TotRmsAbvGrd)) +
  geom_bar(fill = "#3B9AB2") 



#FIREPLACES
#(discrete)
p23<-ggplot(data = combine_data, aes(x = Fireplaces)) +
  geom_bar(fill = "#3B9AB2") 


#skewed
#DISCRETE BUT STILL...
p24 <-ggplot(data = combine_data, aes(x = GarageYrBlt)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) + 
  scale_x_continuous(limits = c(1900, 2020))


#(discrete) (this is absurd, clearly should be ordinal??)
p25<-ggplot(data = combine_data, aes(x = GarageCars)) +
  geom_bar(fill = "#3B9AB2") 
 

#0 = no garage
#CONTINUOUS
p26 <-ggplot(data = combine_data, aes(x = GarageArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#continuous
p27 <-ggplot(data = combine_data, aes(x = WoodDeckSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#continuous
p28 <-ggplot(data = combine_data, aes(x = OpenPorchSF)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 

 
#continuous
p29 <-ggplot(data = combine_data, aes(x = EnclosedPorch)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
 

#continuous
p30 <-ggplot(data = combine_data, aes(x = X3SsnPorch)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#continuous
p31 <-ggplot(data = combine_data, aes(x = ScreenPorch)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#Continuous
p32 <-ggplot(data = combine_data, aes(x = PoolArea)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 
  

#continuous
p33 <-ggplot(data = combine_data, aes(x = MiscVal)) +
  geom_histogram(fill = "#3B9AB2", bins = 50) 


#MO SOLD?? (DISCRETE) (RECODE)
p34 <-ggplot(data = combine_data, aes(x = MoSold)) +
  geom_bar(fill = "#3B9AB2") 


#YR SOLD (discrete)
p35 <-ggplot(data =combine_data, aes(x = YrSold)) +
  geom_bar(fill = "#3B9AB2") 


```

```{r}

grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,
             p12,p13,p14,p15,p16,p17,p18,p19,p20,ncol = 5)
grid.arrange(p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,
             p31,p32,p33,p34,p35,ncol = 4)
```
From the above plots, it can be noticed that there many numeric variables that are not normally distributed coupled with other variables such as LotFrontage, MasVnrArea and other variables appearing to be skewed due to the presence of outliers.

## Distribution Plot of Categorical Variables
```{r}

b1<-ggplot(data = combine_data, aes(x = MSZoning)) +
  geom_bar(fill = "#3B9AB2") 
 

b2<-ggplot(data = combine_data, aes(x = Street)) +
  geom_bar(fill = "#3B9AB2") 


b3<-ggplot(data = combine_data, aes(x = Alley)) +
  geom_bar(fill = "#3B9AB2") 


b4<-ggplot(data = combine_data, aes(x = LotShape)) +
  geom_bar(fill = "#3B9AB2")  


b5 <-ggplot(data = combine_data, aes(x = LandContour)) +
  geom_bar(fill = "#3B9AB2") 



b6 <-ggplot(data = combine_data, aes(x = Utilities)) +
  geom_bar(fill = "#3B9AB2") #+ 



b7 <-ggplot(data = combine_data, aes(x = LotConfig)) +
  geom_bar(fill = "#3B9AB2") 



b8 <-ggplot(data = combine_data, aes(x = LandSlope)) +
  geom_bar(fill = "#3B9AB2") 



b9<-ggplot(data = combine_data, aes(x = Neighborhood)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 



b10 <-ggplot(data = combine_data, aes(x = Condition1)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 


b11 <-ggplot(data = combine_data, aes(x = Condition2)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 



b12 <-ggplot(data = combine_data, aes(x = BldgType)) +
  geom_bar(fill = "#3B9AB2") 


b13 <-ggplot(data = combine_data, aes(x = HouseStyle)) +
  geom_bar(fill = "#3B9AB2") 



b14 <-ggplot(data = combine_data, aes(x = RoofStyle)) +
  geom_bar(fill = "#3B9AB2") 



b15<-ggplot(data = combine_data, aes(x = RoofMatl)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 
 


b16 <-ggplot(data = combine_data, aes(x = Exterior1st)) +
  geom_bar(fill = "#3B9AB2") + coord_flip()
  


b17 <-ggplot(data = combine_data, aes(x = Exterior2nd)) +
  geom_bar(fill = "#3B9AB2") + coord_flip() 
 


b18 <-ggplot(data = combine_data, aes(x = MasVnrType)) +
  geom_bar(fill = "#3B9AB2") 



b19 <-ggplot(data = combine_data, aes(x = ExterQual)) +
  geom_bar(fill = "#3B9AB2") 



b20 <-ggplot(data = combine_data, aes(x = ExterCond)) +
  geom_bar(fill = "#3B9AB2") 



b21 <-ggplot(data = combine_data, aes(x = Foundation)) +
  geom_bar(fill = "#3B9AB2") 


b22 <-ggplot(data = combine_data, aes(x = BsmtQual)) +
  geom_bar(fill = "#3B9AB2") 

b23 <-ggplot(data = combine_data, aes(x = BsmtCond)) +
  geom_bar(fill = "#3B9AB2") 
 


b24 <-ggplot(data = combine_data, aes(x = BsmtExposure)) +
  geom_bar(fill = "#3B9AB2") 


b25 <-ggplot(data = combine_data, aes(x = BsmtFinType1)) +
  geom_bar(fill = "#3B9AB2") 
 


b26 <-ggplot(data = combine_data, aes(x = BsmtFinType2)) +
  geom_bar(fill = "#3B9AB2") 
  

b27 <-ggplot(data = combine_data, aes(x = Heating)) +
  geom_bar(fill = "#3B9AB2") 



b28 <-ggplot(data = combine_data, aes(x = HeatingQC)) +
  geom_bar(fill = "#3B9AB2") 


b29 <-ggplot(data =combine_data, aes(x = CentralAir)) +
  geom_bar(fill = "#3B9AB2") 
  


b30 <-ggplot(data = combine_data, aes(x = Electrical)) +
  geom_bar(fill = "#3B9AB2")  


b31 <-ggplot(data = combine_data, aes(x = KitchenQual)) +
  geom_bar(fill = "#3B9AB2") 



b32<-ggplot(data = combine_data, aes(x = Functional)) +
  geom_bar(fill = "#3B9AB2") 



b33<-ggplot(data = combine_data, aes(x = FireplaceQu)) +
  geom_bar(fill = "#3B9AB2") 


b34<-ggplot(data = combine_data, aes(x = GarageType)) +
  geom_bar(fill = "#3B9AB2") 


b35 <-ggplot(data = combine_data, aes(x = GarageFinish)) +
  geom_bar(fill = "#3B9AB2") 



b36 <-ggplot(data = combine_data, aes(x = GarageQual)) +
  geom_bar(fill = "#3B9AB2") 

```

```{r}
grid.arrange(b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,ncol = 4)
grid.arrange(b11,b12,b13,b14,b15,b16,b17,b18,b19,b20,ncol = 4)
grid.arrange(b21,b22,b23,b24,b25,b26,b27,b28,b29,b30,b31,b32,b33,b34,b35,b36,ncol = 4)
```
From the above some categorical variables display evidence of non-normality and (multi)collineraity whiles others variables may be useful features in predicting the target variable. It can also be noted that are significant amount of missing data values.Many of these NA observations are likely to be structurally missing for a logical reason. For instance GarageArea= NA, suggest that the property has no garage. As such missing data will require imputation.

## Splitting Data into Numerical and Categorical Variables
For the purpose of visualization, the combined dataset is split into numerical and categorical variables.
Also one train dataset with categorical and numeric variable is created 
```{r}
cat_var <- names(combine_data)[which(sapply(combine_data, is.character))]
numeric_var <- names(combine_data)[which(sapply(combine_data, is.numeric))]
```

```{r}
train1_cat<-combine_data[cat_var]
train1_num<-combine_data[numeric_var]
```

##  Bivariate Plots(numeric)
This section looks at each variable and how they relate to the target variable(salePrice)
```{r}
# Overall Quality vs Sale Price
# unique(train$OverallQual)
ggplot(train, mapping = aes(x = factor(OverallQual), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")
```
From the above plot, it can be suggested that people are willing to pay more for better quality.
```{r}
# Living Area vs Sale Price
ggplot(train, mapping = aes(x = GrLivArea, y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")
```
From the above it can be suggested that people are would pay more for living area. However its not reasonable for people to pay less for large area as two data point in the bottom-right of the plot. These two points however needs to be removed.
```{r}
# Removing outliers manually (The two points in the bottom right)
train = train[train$GrLivArea<=4500,]
ggplot(train, mapping = aes(x = GrLivArea, y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")
```

```{r}
# Garage Area vs Sale Price
unique(train$GarageCars)
ggplot(train, mapping = aes(x = factor(GarageCars), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

```
The relationship between garagecars and saleprice suggest that 4-car garages result in less sale Price which might be as a result of outliers. These outliers must be removed.
```{r}
train = train[train$GarageCars < 4, ]
ggplot(train, mapping = aes(x = factor(GarageCars), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")
```

```{r}
# Garage Area vs Sale Price
ggplot(train, mapping = aes(x = GarageArea , y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")
```
```{r}
##Again, the two data points at the bottom does not make sense
train = filter(train, GarageArea < 1240)
ggplot(train, mapping = aes(x = GarageArea , y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")
```
```{r}
# Basement Area vs Sale Price
ggplot(train, mapping = aes(x = TotalBsmtSF , y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")
```

```{r}
# First Floor Area vs Sale Price
ggplot(train, mapping = aes(x = X1stFlrSF, y = SalePrice)) + geom_point() + geom_smooth(method = "lm",col = "#3B9AB2")
```
```{r}
# # ExterQual vs Sale Price
ggplot(train, mapping = aes(x = factor(ExterQual), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")
```
The above plot suggest that, houses with excellent external quality are expensive whereas houses with fair quality material are less expensive.

```{r}
# ExterQual vs Sale Price
ggplot(train, mapping = aes(x = factor(FullBath), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

```
From the above plot it is not reasonable that houses with zero full bath are expensive than houses with one full bath.This will be an outlier and must be removed from the data set.
```{r}
train = train[train$FullBath > 0 , ]
ggplot(train, mapping = aes(x = factor(FullBath), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")
```
```{r}
#BsmtQual and SalePrice
ggplot(train, mapping = aes(x = factor(BsmtQual), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")

```
```{r}
# Total Rooms vs Sale Price
ggplot(train, mapping = aes(x = factor(TotRmsAbvGrd), y = SalePrice)) + geom_boxplot(col = "#3B9AB2")+ scale_y_discrete()
```

## Correlation Matrix
```{r}
correlations <- cor(na.omit(train1_num[,-1]))
head(correlations, 2)
```
```{r}
melted_correlations = melt(correlations)
head(melted_correlations, 2)


ggplot(data = melted_correlations, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Pearson\nCorrelation") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.4,
                                   size = 12, hjust = 0.2))+ theme(aspect.ratio = 1) +
  coord_fixed()
```
The above correlation plot is too clustered, hence lets consider the top 10 correlated variables.

### Top 10 correlated variables
```{r}
corr_cross(train, # name of dataset
           max_pvalue = 0.05, # display only significant correlations (at 5% level)
           top = 10 # display top 10 couples of variables (by correlation coefficient)
)
```


### Top 10 correlated variables with Sale Price
```{r}
top_10 = corr_var(train, # name of dataset
                  SalePrice, # name of variable to focus on
                  top = 10 # display top 5 correlations
)
top_10

```
From the above, Blue indicate positive correlation and red indicate negative correlations. ExterQual_TA from the plot negatively correlates with the target variable.



# Data Cleaning
Before performing modeling, all variables with missing data are identified and corrected.
```{r}
test$SalePrice <- NA
train$isTrain <- 1
test$isTrain <- 0
combine_data <-rbind(train,test)
head(combine_data,2)



Missing_indices <- sapply(combine_data,function(x) sum(is.na(x)))/nrow(combine_data)

Missing_Summary <- data.frame(index = names(combine_data),Missing_Values=Missing_indices)

Missing_Summary[order(Missing_Summary$Missing_Values > 0, decreasing = TRUE),]
```




```{r}
### percentage of missing cases
sum(is.na(combine_data))/ (dim(combine_data)[1] *dim(combine_data)[2]) *100 

all_data_na = Missing_Summary[order(Missing_Summary$Missing_Values, decreasing = TRUE), ]
all_data_na[1:5, ]

all_data <- combine_data

```
## Inputing Missing Values
From the above variables with missing values are fixed since most of theme are actually not missing.
```{r}
#Getmode

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
#PoolQC
#Changing NA in PoolQC to None
all_data$PoolQC1 <- as.character(all_data$PoolQC)
all_data$PoolQC1[which(is.na(all_data$PoolQC))] <- "None"
all_data$PoolQC <- as.factor(all_data$PoolQC1)
all_data <- subset(all_data,select = -PoolQC1)
sum(is.na(all_data$PoolQC))

#MiscFeature
all_data$MiscFeature1 <- as.character(all_data$MiscFeature)
all_data$MiscFeature1[which(is.na(all_data$MiscFeature))] <- "None"
all_data$MiscFeature <- as.factor(all_data$MiscFeature1)
all_data <- subset(all_data,select = -MiscFeature1)
sum(is.na(all_data$MiscFeature))

#Alley
all_data$Alley1 <- as.character(all_data$Alley)
all_data$Alley1[which(is.na(all_data$Alley))] <- "None"
all_data$Alley <- as.factor(all_data$Alley1)
all_data <- subset(all_data,select = -Alley1)
sum(is.na(all_data$Alley))

#Fence
all_data$Fence1 <- as.character(all_data$Fence)
all_data$Fence1[which(is.na(all_data$Fence))] <- "None"
all_data$Fence <- as.factor(all_data$Fence1)
all_data <- subset(all_data,select = -Fence1)
sum(is.na(all_data$Fence))

#FireplaceQu
all_data$FireplaceQu1 <- as.character(all_data$FireplaceQu)
all_data$FireplaceQu1[which(is.na(all_data$FireplaceQu))] <- "None"
all_data$FireplaceQu <- as.factor(all_data$FireplaceQu1)
all_data <- subset(all_data,select = -FireplaceQu1)
sum(is.na(all_data$FireplaceQu))

unique(all_data$FireplaceQu)

#LotFrontage
all_data$LotFrontage[which(is.na(all_data$LotFrontage))] <- median(all_data$LotFrontage,na.rm = T)
sum(is.na(all_data$LotFrontage))

#GarageType
all_data$GarageType1 <- as.character(all_data$GarageType)
all_data$GarageType1[which(is.na(all_data$GarageType))] <- "None"
all_data$GarageType <- as.factor(all_data$GarageType1)
all_data <- subset(all_data,select = -GarageType1)
sum(is.na(all_data$GarageType))

#GarageFinish
all_data$GarageFinish1 <- as.character(all_data$GarageFinish)
all_data$GarageFinish1[which(is.na(all_data$GarageFinish))] <- "None"
all_data$GarageFinish <- as.factor(all_data$GarageFinish1)
all_data <- subset(all_data,select = -GarageFinish1)
sum(is.na(all_data$GarageFinish))

#GarageQual
all_data$GarageQual1 <- as.character(all_data$GarageQual)
all_data$GarageQual1[which(is.na(all_data$GarageQual))] <- "None"
all_data$GarageQual <- as.factor(all_data$GarageQual1)
all_data <- subset(all_data,select = -GarageQual1)

#GarageCond
all_data$GarageCond1 <- as.character(all_data$GarageCond)
all_data$GarageCond1[which(is.na(all_data$GarageCond))] <- "None"
all_data$GarageCond <- as.factor(all_data$GarageCond1)
all_data <- subset(all_data,select = -GarageCond1)

unique(all_data$GarageCond)

#GarageYrBlt
all_data$GarageYrBlt[which(is.na(all_data$GarageYrBlt))] <- 0
sum(is.na(all_data$GarageYrBlt))
sum(is.na(all_data$GarageCond))

# head(all_data, 2)
unique(all_data$GarageYrBlt)

#GarageArea
all_data$GarageArea[which(is.na(all_data$GarageArea))] <- 0
#GarageCars
all_data$GarageCars[which(is.na(all_data$GarageCars))] <- 0
#BsmtFinSF1
all_data$BsmtFinSF1[which(is.na(all_data$BsmtFinSF1))] <- 0
#BsmtFinSF2
all_data$BsmtFinSF2[which(is.na(all_data$BsmtFinSF2))] <- 0
#BsmtUnfSF
all_data$BsmtUnfSF[which(is.na(all_data$BsmtUnfSF))] <- 0
#TotalBsmtSF
all_data$TotalBsmtSF[which(is.na(all_data$TotalBsmtSF))] <- 0
#BsmtFullBath
all_data$BsmtFullBath[which(is.na(all_data$BsmtFullBath))] <- 0
#BsmtHalfBath
all_data$BsmtHalfBath[which(is.na(all_data$BsmtHalfBath))] <- 0
#BsmtQual
all_data$BsmtQual1 <- as.character(all_data$BsmtQual)
all_data$BsmtQual1[which(is.na(all_data$BsmtQual))] <- "None"
all_data$BsmtQual <- as.factor(all_data$BsmtQual1)
all_data <- subset(all_data,select = -BsmtQual1)
#BsmtCond
all_data$BsmtCond1 <- as.character(all_data$BsmtCond)
all_data$BsmtCond1[which(is.na(all_data$BsmtCond))] <- "None"
all_data$BsmtCond <- as.factor(all_data$BsmtCond1)
all_data <- subset(all_data,select = -BsmtCond1)

#BsmtExposure
all_data$BsmtExposure1 <- as.character(all_data$BsmtExposure)
all_data$BsmtExposure1[which(is.na(all_data$BsmtExposure))] <- "None"
all_data$BsmtExposure <- as.factor(all_data$BsmtExposure1)
all_data <- subset(all_data, select = -BsmtExposure1)

#BsmtFinType1
all_data$BsmtFinType11 <- as.character(all_data$BsmtFinType1)
all_data$BsmtFinType11[which(is.na(all_data$BsmtFinType1))] <- "None"
all_data$BsmtFinType1 <- as.factor(all_data$BsmtFinType11)
all_data <- subset(all_data,select = -BsmtFinType11)

#BsmtFinType2
all_data$BsmtFinType21 <- as.character(all_data$BsmtFinType2)
all_data$BsmtFinType21[which(is.na(all_data$BsmtFinType2))] <- "None"
all_data$BsmtFinType2 <- as.factor(all_data$BsmtFinType21)
all_data <- subset(all_data,select = -BsmtFinType21)
#MasVnrType

all_data$MasVnrType1 <- as.character(all_data$MasVnrType)
all_data$MasVnrType1[which(is.na(all_data$MasVnrType))] <- "None"
all_data$MasVnrType <- as.factor(all_data$MasVnrType1)
all_data <- subset(all_data,select = -MasVnrType1)

#MasVnrArea
all_data$MasVnrArea[which(is.na(all_data$MasVnrArea))] <- 0

#MSZoning
all_data$MSZoning1 <- as.character(all_data$MSZoning)
all_data$MSZoning1[which(is.na(all_data$MSZoning))] <- getmode(all_data$MSZoning)
all_data$MSZoning <- as.factor(all_data$MSZoning1)
all_data <- subset(all_data,select = -MSZoning1)


#Utilities We can safely drop this feature since most of the observations are “AllPub”, 1 is “NoSewa”, and 2 “NA”
unique(all_data$Utilities)

## [1] "AllPub" "NoSeWa" NA
table(all_data$Utilities)
all_data$Utilities = NULL

#Functional
all_data$Functional1 <- as.character(all_data$Functional)
all_data$Functional1[which(is.na(all_data$Functional))] <- "Typ"
all_data$Functional <- as.factor(all_data$Functional1)
all_data <- subset(all_data,select = -Functional1)
#table(all_data$Functional)

#Electrical
all_data$Electrical1 <- as.character(all_data$Electrical)
all_data$Electrical1[which(is.na(all_data$Electrical))] <- getmode(all_data$Electrical)
all_data$Electrical <- as.factor(all_data$Electrical1)
all_data <- subset(all_data,select = -Electrical1)

#KitchenQual
all_data$KitchenQual1 <- as.character(all_data$KitchenQual)
all_data$KitchenQual1[which(is.na(all_data$KitchenQual))] <- getmode(all_data$KitchenQual)
all_data$KitchenQual <- as.factor(all_data$KitchenQual1)
all_data <- subset(all_data,select = -KitchenQual1)

#Exterior1st
all_data$Exterior1st1 <- as.character(all_data$Exterior1st)
all_data$Exterior1st1[which(is.na(all_data$Exterior1st))] <- getmode(all_data$Exterior1st)
all_data$Exterior1st <- as.factor(all_data$Exterior1st1)
all_data <- subset(all_data,select = -Exterior1st1)

#Exterior2nd
all_data$Exterior2nd1 <- as.character(all_data$Exterior2nd)
all_data$Exterior2nd1[which(is.na(all_data$Exterior2nd))] <- getmode(all_data$Exterior2nd)
all_data$Exterior2nd <- as.factor(all_data$Exterior2nd1)
all_data <- subset(all_data,select = -Exterior2nd1)

#SaleType
all_data$SaleType1 <- as.character(all_data$SaleType)
all_data$SaleType1[which(is.na(all_data$SaleType))] <- getmode(all_data$SaleType)
all_data$SaleType <- as.factor(all_data$SaleType1)
all_data <- subset(all_data,select = -SaleType1)
head(all_data,2)
```
## Converting to Factors
```{r}
all_data$Street = factor(all_data$Street)
all_data$LotShape = factor(all_data$LotShape)
all_data$LandContour = factor(all_data$LandContour)
all_data$LotConfig = factor(all_data$LotConfig)
all_data$LandSlope = factor(all_data$LandSlope)
all_data$Neighborhood = factor(all_data$Neighborhood)
all_data$Condition1 = factor(all_data$Condition1)
all_data$Condition2 = factor(all_data$Condition2)
all_data$BldgType = factor(all_data$BldgType)
all_data$HouseStyle = factor(all_data$HouseStyle)
all_data$OverallQual = factor(all_data$OverallQual)
all_data$OverallCond = factor(all_data$OverallCond)
all_data$YearBuilt = as.character((all_data$YearBuilt))
all_data$GarageYrBlt = as.character((all_data$GarageYrBlt))
all_data$YearRemodAdd = as.character(all_data$YearRemodAdd)
all_data$RoofStyle = factor(all_data$RoofStyle)
all_data$RoofMatl = factor(all_data$RoofMatl)
all_data$ExterQual = factor(all_data$ExterQual)
all_data$ExterCond = factor(all_data$ExterCond)
all_data$MSSubClass = factor(all_data$MSSubClass)
all_data$Foundation = factor(all_data$Foundation)
all_data$Heating = factor(all_data$Heating)
all_data$HeatingQC = factor(all_data$HeatingQC)
all_data$CentralAir = factor(all_data$CentralAir)
all_data$PavedDrive = factor(all_data$PavedDrive)
all_data$MoSold = factor(all_data$MoSold)
all_data$YrSold = factor(all_data$YrSold)
all_data$SaleCondition = factor(all_data$SaleCondition)
all_data$BsmtFullBath = factor(all_data$BsmtFullBath)
all_data$BsmtHalfBath = factor(all_data$BsmtHalfBath)
all_data$FullBath = factor(all_data$FullBath)
all_data$HalfBath = factor(all_data$HalfBath)
```

```{r}
head(all_data, 2)
#Determining skew of each numeric variable

Column_classes <- sapply(names(all_data),function(x){class(all_data[[x]])})
numeric_columns <-names(Column_classes[Column_classes != "factor" & Column_classes != "character"])

skew <- sapply(numeric_columns,function(x){skewness(all_data[[x]],na.rm = T)})
# Let us determine a threshold skewness and transform all variables above the treshold.
skew <- skew[skew > 0.75]
# transform excessively skewed features with log(x + 1)
for(x in names(skew))
{
  all_data[[x]] <- log(all_data[[x]] + 1)
}
ggplot(all_data, mapping = aes(x = SalePrice)) + geom_histogram(color = "black", fill = "blue")

```

## Feature Engineering
Here it can be suggested that total basement surface area, 1st floor surface area and 2nd floor surface area can be equated to total surface area. Based on this TotalSF is created.
```{r}
#linear relationship b/n ground and upper floors
all_data$TotalSF = all_data$TotalBsmtSF + all_data$X1stFlrSF + all_data$X2ndFlrSF

#all_data <-subset(all_data, select = -c(log_SalePrice))
which(colSums(is.na(all_data))>0)
dim(all_data)
```

## Data Partition
```{r}
train_1 <- all_data[all_data$isTrain==1,]
test_1 <- all_data[all_data$isTrain==0,]
dim(train_1)
dim(test_1)
dim(all_data)
```


```{r}
train_1 = subset(train_1, select = -c(Id, isTrain))
test_1 = subset(test_1, select = -c(Id, isTrain, SalePrice))
set.seed(100)
partition <- createDataPartition(train_1$SalePrice, p = 0.80, list = FALSE)
train.m <- train_1[partition, ]
test.m <- train_1[-partition, ]

head(train.m, 3)
```


# Modeling
## RandomForest Model
```{r}
###FULL MODEL ###
model1 = randomForest(SalePrice ~ ., data = train.m)
summary(model1)
```
## RF Model Plot
```{r}
plot(model1)
```
## Variable Importance
```{r}
imp = importance(model1)
imp
```

### Plot of Variable Importance
```{r}
varImpPlot(model1)
```
```{r}
# RMSE of full model using log of SalePrice
RMSE(test.m$SalePrice, predict(model1, newdata = test.m))
```



## Prediction with the full model

```{r}
pred = predict(model1, newdata = test.m)
df = data.frame(Pred = exp(pred), Test = exp(test.m$SalePrice))

```

### Prediction VS Actual

```{r}
head(df, 5)
```
### RMSE of the full model using SalePrice 
```{r}
sqrt( mean( (exp(pred)-exp(test.m$SalePrice)) ^2) )
```
## Reduced Model
```{r}
train.rf = train.m
model.select.rf = randomForest(SalePrice ~ TotalSF + MSZoning + LotArea + LotShape + Neighborhood + Condition1 +Condition2+
                                 BldgType +  OverallQual +OverallCond + YearBuilt + RoofMatl + Exterior1st + ExterCond+
                                 Foundation +BsmtQual+ KitchenQual+FullBath+BsmtExposure+Fireplaces+ SaleCondition+
                                 CentralAir+PavedDrive,data = train.rf)  
summary(model.select.rf)


###model plot
plot(model.select.rf)

```
### variable of importance Reduced Model
```{r}
imp = importance(model.select.rf)
imp
```
### Plot of Variable of Importance
```{r}
varImpPlot(model.select.rf)
```
### RMSE of the Reduced model using log of SalePrice
```{r}
RMSE(test.m$SalePrice, predict(model.select.rf, newdata = test.m))
```
### Prediction vs Actual 
```{r}
#####Prediction with the reduced model###
pred = predict(model.select.rf, newdata = test.m)

########
df = data.frame(Pred = exp(pred), Test = exp(test.m$SalePrice))
head(df, 5)
```


### RMSE of the Reduced model using SalePrice
```{r}
pred = predict(model.select.rf, newdata = test.m)


sqrt( mean( (exp(pred)-exp(test.m$SalePrice)) ^2) )
```
Given the RMSE of the full model and the Reduce model,it can be suggested that the full model returns  a lower RMSE when compared to the reduced model, hence  will cost less to deploy.