1 Introduction

It’s a Kaggle competition House Prices: Advanced Regression Techniques The Ames Housing dataset was compiled by Dean De Cock for use in data science education. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. At this competition, I predict the final price of each home.

2 Loading and Exploring Data

2.1 Loading Libraries and reading data

library(data.table) #read a dataset
library(dplyr) #manipulate data
library(ggplot2) #draw plots
library(visdat) #visualise missing data
library(corrplot) #built correlation matrix
library(Metrics) #evaluate models
library(caret) #work with missing data and built a model
library(glmnet) #built ridge and lasso regression
library(xgboost)#built a model
library(gbm)
library(randomForest)
train<-fread("train.csv")
test<-fread("test.csv")
price<-fread("sample_submission.csv")

2.2 Data size and structure

Our dataset has 81 variables. Training data has 1460 rows and test data has 1459 rows.

dim(train)
## [1] 1460   81
dim(test) #without price 
## [1] 1459   80

To work with missing data, I combine test and train data, remove Id.

test_price<-cbind(test, SalePrice=price$SalePrice) #combine test data with price
data<-rbind(train[,-1], test_price[,-1]) #combine train and test data without Id

Here we can see a structure of our data.

data<-as.data.frame(data)
str(data[,c(1:10, 80)])
## 'data.frame':    2919 obs. of  11 variables:
##  $ MSSubClass : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning   : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage: int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea    : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street     : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley      : chr  NA NA NA NA ...
##  $ LotShape   : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour: chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities  : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig  : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ SalePrice  : num  208500 181500 223500 140000 250000 ...

3 Work with missing data

3.1 Visualisation missing data

Using library(visdat), I visualise missing data

vis_miss(data)

table(is.na(data))
## 
##  FALSE   TRUE 
## 219555  13965

Dataset is missing 6% of information

3.2 Imputing missing data

To impute missing values in categorical variables, I use mode:

#Finction to find out mode
stat_mode <- function(x){
  t1<-table(x)
  result<-as.vector(names(which(t1==max(t1))))
  return (result)
}

MSZoning: Identifies the general zoning classification of the sale

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density

Mode of MSZoning is stat_mode(data$MSZoning), so I impute this value.

#MSZoning NA->mode RL
data$MSZoning[is.na(data$MSZoning)]<-stat_mode(data$MSZoning)

LotFrontage: Linear feet of street connected to property I guess that NAs mean 0.

#LotFrontage NA->0
data$LotFrontage[is.na(data$LotFrontage)]<-0

Alley: Type of alley access to property

   Grvl Gravel
   Pave Paved
   NA   No alley access

I swith NAs to “None”.

#Alley NA->None
data$Alley[is.na(data$Alley)]<-"None"

Utilities,Exterior1st, Exterior2nd, Electrical, KitchenQual,Functional switch to mode.

Amount of NAs MasVnrArea is equal to MasVnrType.

I think MasVnrType NAs are None. MasVnrType: Masonry veneer type

   BrkCmn   Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   None None
   Stone    Stone

MasVnrArea: Masonry veneer area in square feet. If MasVnrType is None then MasVnrArea is equal to 0 square feet

#NA MasVnrType->None
data$MasVnrType[is.na(data$MasVnrType)]<-"None"
#If MasVnrType->None, MasVnrArea==0
data$MasVnrArea<-ifelse(data$MasVnrType=="None", 0, data$MasVnrArea)

A lot of missing values there are in variables which describe basement: height, condition, level walls, rating of basement finished area, finished square feet, unfinished square feet of basement area and total square feet of basement area. We see that amount of missin values in BsmtQual is equal to BsmtCond.I will not show here all manipulation with basement’s missing values. The main idea is if BsmtQual is None then all basement parametrs are None or 0.

If Bath is NA, then 0. Also, in all houses without Fireplaces FireplaceQu is None.

data$BsmtFullBath[is.na(data$BsmtFullBath)]<-0
data$BsmtHalfBath[is.na(data$BsmtHalfBath)]<-0
data$FireplaceQu<-ifelse(data$Fireplaces==0, "None",data$FireplaceQu )

In variables that describe garage there is error.There are some houses where YearBuilt>GarageYrBlt. I switch it to YearBuilt.

garage<-data %>% 
  filter(YearBuilt>GarageYrBlt) %>% 
  select(GarageType,YearBuilt, GarageYrBlt)
garage
##    GarageType YearBuilt GarageYrBlt
## 1      Detchd      1927        1920
## 2      Detchd      1910        1900
## 3     BuiltIn      1967        1961
## 4     BuiltIn      2005        2003
## 5      Detchd      1950        1949
## 6     BuiltIn      1959        1954
## 7      Detchd      1930        1925
## 8      Detchd      1923        1922
## 9      Detchd      1963        1962
## 10     Attchd      1959        1956
## 11     Attchd      2010        2009
## 12     Detchd      1935        1920
## 13     Detchd      1978        1960
## 14     Detchd      1941        1940
## 15     Detchd      1935        1926
## 16     Attchd      1945        1925
## 17     Attchd      2006        2005
## 18     Attchd      2006        2005

Manipulations with garage parametrs are by the same principle by the same principle like with basement.

3.3 Сhanging the type of variables

I change character variables, Month Sold, MSSubClass to factor.

data$MoSold<-as.factor(data$MoSold)
data$MSSubClass<-as.factor(data$MSSubClass)
data$GarageCars<-as.integer(data$GarageCars)
#switch character to factor
data<-data %>% 
  mutate_if(is.character, as.factor)
str(data)
## 'data.frame':    2919 obs. of  80 variables:
##  $ MSSubClass   : Factor w/ 16 levels "20","30","40",..: 6 1 6 7 6 5 1 6 5 16 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : num  65 80 68 60 84 85 75 0 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 3 3 5 3 3 1 3 5 5 ...
##  $ BsmtCond     : Factor w/ 5 levels "Fa","Gd","None",..: 5 5 5 2 5 5 5 5 5 5 ...
##  $ BsmtExposure : Factor w/ 5 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 7 3 ...
##  $ BsmtFinSF1   : num  706 978 486 216 655 ...
##  $ BsmtFinType2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 2 7 7 ...
##  $ BsmtFinSF2   : num  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num  856 1262 920 756 1145 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ 1stFlrSF     : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ 2ndFlrSF     : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : num  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 6 levels "Ex","Fa","Gd",..: 4 6 6 3 6 4 3 6 6 6 ...
##  $ GarageType   : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 4 levels "Fin","None","RFn",..: 3 3 3 4 3 4 3 3 4 3 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 2 3 ...
##  $ GarageCond   : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ 3SsnPorch    : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Fence        : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
##  $ MiscFeature  : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : Factor w/ 12 levels "1","2","3","4",..: 2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : num  208500 181500 223500 140000 250000 ...

Quality and condition have the same gradation:

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   None None 

So I will use integer variables:

   5 Excellent
   4 Good
   3 Typical/Average
   2 Fair
   1 Poor
   0 None

4 Creating new variables

4.1 Total Barhroom

BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade

A half-bath, also known as a powder room or guest bath, has only two of the four main bathroom components—typically a toilet and sink.I create new variable Total Barhroom that is equal to sum of bath variables.

data$Totbath<-data$BsmtFullBath+data$BsmtHalfBath*0.5+data$FullBath+data$HalfBath*0.5

4.2 House Age and is.Remod

This new variable is Year Sold minus Remodel date.To know that a house had remodeling, I add variable that shows has a house been remodeled. Variable is.new help us to find out a house is new or old.

data$Age<-data$YrSold-data$YearRemodAdd
data$Remod<-ifelse(data$YearBuilt==data$YearRemodAdd, 0,1) #0-No, 1-Remod
data$is.new<-ifelse(data$YearBuilt==data$YrSold, 1,0) #1-new, 0-old

4.3 Total Porch

There are 4 variables that contain information about area in square feet of different type of porches.I will unite this information in one variable Total Porch.

data$TotPorsh<-data$OpenPorchSF+data$EnclosedPorch+data$`3SsnPorch`+data$ScreenPorch

5 Correlation

5.1 High correlation with price

I created a matrix that contains variables with high correlation with price.

train<-as.data.frame(data[1:1460,])
test<-as.data.frame(data[1461:nrow(data),])
train_num<-train[sapply(train,is.numeric)]

corr_matrix<-cor(train_num)
cor_sorted <- as.matrix(sort(corr_matrix[,'SalePrice'], decreasing = TRUE))
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
corr_matrix <- corr_matrix[CorHigh, CorHigh]

corrplot.mixed(corr_matrix, tl.col="black", tl.pos = "lt")

As we see, there is high correlation between some variables, for example, between GarageCars and GarageArea, GarageCond and GarageQual.

train_num_no_price<-train_num %>% 
  select(-SalePrice)
corr_matrix_n<-cor(train_num_no_price)

top.mat <- function(X, level=0.45, N=12, values=TRUE) {
  X.nam <- row.names(X)
  X.tri <- as.vector(lower.tri(X))
  X.rep.g <- rep(X.nam, length(X.nam))[X.tri]
  X.rep.e <- rep(X.nam, each=length(X.nam))[X.tri]
  X.vec <- as.vector(X)[X.tri]
  X.df <- data.frame(Var1=X.rep.g, Var2=X.rep.e, Value=X.vec)
  {if (values)
  {X.df <- X.df[abs(X.df$Value) >= level, ]
  X.df <- X.df[order(-abs(X.df$Value)), ]}
    else
    {X.df <- X.df[order(-abs(X.df$Value)), ]
    X.df <- X.df[1:N, ]}}
  row.names(X.df) <- seq(1, along=X.df$Value)
  return(X.df)
}
highcor<-top.mat(corr_matrix_n)
highcor %>% 
  filter(Value>0.75)
##           Var1        Var2     Value
## 1   GarageCond  GarageQual 0.9591716
## 2       PoolQC    PoolArea 0.9370565
## 3   GarageArea  GarageCars 0.8824754
## 4  FireplaceQu  Fireplaces 0.8632412
## 5  GarageYrBlt   YearBuilt 0.8453790
## 6 TotRmsAbvGrd   GrLivArea 0.8254894
## 7     1stFlrSF TotalBsmtSF 0.8195300
highCor = findCorrelation(corr_matrix_n, cutoff = 0.75)
print("Имена этих переменных:")
## [1] "Имена этих переменных:"
names(train_num)[highCor] 
## [1] "GrLivArea"   "GarageCars"  "YearBuilt"   "TotalBsmtSF" "Totbath"    
## [6] "FireplaceQu" "GarageQual"  "PoolArea"

To determine variables that I will remove because of high correlation, I see correlation between SalePrice and some variables. I will remove variable with less coeficient.

cor(train$SalePrice, train$GarageCond)
## [1] 0.2631908
cor(train$SalePrice, train$GarageQual)
## [1] 0.2738391
cor(train$SalePrice, train$GarageArea)
## [1] 0.6234314
cor(train$SalePrice, train$GarageCars)
## [1] 0.6404092
cor(train$SalePrice, train$TotRmsAbvGrd)
## [1] 0.5337232
cor(train$SalePrice, train$GrLivArea)
## [1] 0.7086245
cor(train$SalePrice, train$`1stFlrSF`)
## [1] 0.6058522
cor(train$SalePrice, train$TotalBsmtSF)
## [1] 0.6135806

5.2 Linear dependend variables

Also I search linear dependend variables.

linCombo <- findLinearCombos(corr_matrix_n)
names(train_num_no_price)[linCombo$remove]
## [1] "TotalBsmtSF" "GrLivArea"   "Totbath"     "Age"         "TotPorsh"

5.3 Removing variables

I remove all variables except YrBlt that are part of TotalBsmtSF, GrLivArea, Totbath, Age, TotPorsh. Then I reorder variables in data frame.

remove_name<-which(colnames(data) %in% c("YearRemodAdd", 'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF',
                             'BsmtFullBath','BsmtHalfBath', 'HalfBath', 'FullBath',
                             'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'GarageCond', 'GarageArea',
                             'TotRmsAbvGrd','1stFlrSF','2ndFlrSF'))
#new data (without remove variables)
data<-data[,-remove_name]

#reordered data
data<-data %>% 
  select(MSSubClass:LowQualFinSF,Totbath:is.new, GrLivArea:GarageQual, TotPorsh, PavedDrive:SalePrice)
train<-data[1:1460,]
test<-data[1461:nrow(data),]
train_num<-train[sapply(train,is.numeric)]

corr_matrix<-cor(train_num)
cor_sorted <- as.matrix(sort(corr_matrix[,'SalePrice'], decreasing = TRUE))
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
corr_matrix <- corr_matrix[CorHigh, CorHigh]

corrplot.mixed(corr_matrix, tl.col="black", tl.pos = "lt")

6 Bar plots for factor variables

6.1 Function for bar plots

## [1] 35

In the dataset, there are 35 factor variables.All factor variables I split into 11 groups. Each group describes some parametrs of a real estate. The first group is type and style of dwelling. I created a function to show interconnection between factor variables and price.

factors <- sapply(train, function(x) is.factor(x))
factors_only<- train[,factors]
bar_price <- function(data, var,color=mycol){
  ggplot(data,aes(fct_reorder(!!sym(var),SalePrice, .desc = TRUE), 
                  SalePrice, fill = !!sym(var))) +
    stat_summary(aes(y = SalePrice), fun = "median", geom = "bar")+
    geom_hline(yintercept = median(data$SalePrice), color="red")+
    scale_fill_manual(values = rep(color,  25))+
    geom_label(stat = "count", aes(label = ..count.., y = ..count..), 
               fill="white")+
    ylab("SalePrice")+
    xlab(var)+
    theme_bw()+
    theme(legend.position = "none")
}
temp1 <- lapply(names(factors_only), bar_price, data = train, "#202040")
temp2<-lapply(names(factors_only), bar_price, data = train, "#4ea0ae")
temp3<-lapply(names(factors_only), bar_price, data = train, "#158467")
temp4<-lapply(names(factors_only), bar_price, data = train, "#ffd571")

Every bar plot shows median price in each category of factor variable. Red line is median price of factor variable.Label at the bottom of each bar means amount of houses in the category.

6.2 Type and style

The first group is type and style of dwelling.

MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density
class_zone<-grid.arrange(temp1[[1]],temp1[[2]])

my_summarise <- function(data=train, group_var) {
  data %>%
    group_by({{ group_var }}) %>%
    dplyr::summarise(median = mean(SalePrice), n=n()) %>% 
    arrange(desc(n))
}
MSSubClass<-train %>% 
  my_summarise(MSSubClass)
  
MSZoning<-train %>% 
  my_summarise(MSZoning) 

Insights:

  • 3 most popular type of dwelling is:

    • 20 1-STORY 1946 & NEWER ALL STYLES

    • 60 2-STORY 1946 & NEWER

    • 50 1-1/2 STORY FINISHED ALL AGES

  • 3 most expensive category is:

    • 60 2-STORY 1946 & NEWER

    • 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER

    • 80 SPLIT OR MULTI-LEVEL

  • Most popular zoning is RL Residential Low Density with median price $191005

6.3 Lot

To this group I included all the information about the lot and the area:

Street: Type of road access to property

   Grvl Gravel  
   Pave Paved

Alley: Type of alley access to property

   Grvl Gravel
   Pave Paved
   None     No alley access
    

LotShape: General shape of property

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular

LandContour: Flatness of the property

   Lvl  Near Flat/Level 
   Bnk  Banked - Quick and significant rise from street grade to building
   HLS  Hillside - Significant slope from side to side
   Low  Depression

LotConfig: Lot configuration

   Inside   Inside lot
   Corner   Corner lot
   CulDSac  Cul-de-sac
   FR2  Frontage on 2 sides of property
   FR3  Frontage on 3 sides of property

LandSlope: Slope of property

   Gtl  Gentle slope
   Mod  Moderate Slope  
   Sev  Severe Slope
land<-grid.arrange(temp2[[3]],temp2[[4]],temp2[[5]],temp2[[6]], temp2[[8]], temp2[[9]])

Insights:

  • With the exception of 6 houses road access to property is paved.

  • Most houses have not alley.

  • In the main shape of lot is regular but 30% of all properties has slightly irregular shape.

    • Propertie with irregular shape has higher median price.
  • Mostly flatness of the property Lvl Near Flat/Level. Its median price is equal to median price among all houses.

  • 72% houses have inside lot configuration.

    • Houses located cul-de-sac and frontage on 3 sides of property have higher price
  • 95% of houses have Gentle slope

6.4 Type and style of dwelling

This group includes type and style of dwelling:

BldgType: Type of dwelling

   1Fam Single-family Detached  
   2FmCon   Two-family Conversion; originally built as one-family dwelling
   Duplx    Duplex
   TwnhsE   Townhouse End Unit
   TwnhsI   Townhouse Inside Unit

HouseStyle: Style of dwelling

   1Story   One story
   1.5Fin   One and one-half story: 2nd level finished
   1.5Unf   One and one-half story: 2nd level unfinished
   2Story   Two story
   2.5Fin   Two and one-half story: 2nd level finished
   2.5Unf   Two and one-half story: 2nd level unfinished
   SFoyer   Split Foyer
   SLvl Split Level
   
type_style<-grid.arrange(temp4[[13]],temp4[[14]])

Insights:

  • Most popular buiding type is 1Fam Single-family Detached. Its median price is a little higher than median price among all houses

  • One story houses are most popular. In second place in popularity are two story houses, whose price is higher than median price among all houses unlike one story houses.

6.5 Type and style of dwelling

This group brings together type of roof, roof material.

RoofStyle: Type of roof

   Flat Flat
   Gable    Gable
   Gambrel  Gabrel (Barn)
   Hip  Hip
   Mansard  Mansard
   Shed Shed
    

RoofMatl: Roof material

   ClyTile  Clay or Tile
   CompShg  Standard (Composite) Shingle
   Membran  Membrane
   Metal    Metal
   Roll Roll
   Tar&Grv  Gravel & Tar
   WdShake  Wood Shakes
   WdShngl  Wood Shingles

MasVnrType: Masonry veneer type

   BrkCmn   Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   None None
   Stone    Stone
   
roof_exterior<-grid.arrange(temp1[[15]],temp1[[16]], temp1[[19]])

Insights:

  • 78% houses have gable roof. Their price is nearly equal to median price among all houses.

  • 20% houses have hip roof. Their price is a little higher than median price among all houses.

  • Almost all houses have CompShg standard (Composite) shingle as roof material.

  • 60% of houses do not have masonry veneer

  • Houses with stone and brick face masonry veneer type is more exspensive.

6.6 Basement

This group describe basement and foundation parametrs:

Foundation: Type of foundation

   BrkTil   Brick & Tile
   CBlock   Cinder Block
   PConc    Poured Contrete 
   Slab Slab
   Stone    Stone
   Wood Wood

BsmtExposure: Refers to walkout or garden level walls

   Gd   Good Exposure
   Av   Average Exposure (split levels or foyers typically score average or above)  
   Mn   Mimimum Exposure
   No   No Exposure
   NA   No Basement
   

BsmtFinType1 and BsmtFinType2(if multiple types): Rating of basement finished area

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
   
bsmt<-grid.arrange(temp2[[20]],temp2[[21]],temp2[[22]],temp2[[23]])

Insights:

  • 44% hoses have PConc poured contrete foundation and 43% hoses have CBlock cinder block foundation.Poured contrete foundation is more expensive than cinder block foundation.

  • The higher the rating of basement finished area, the higher the price.

6.7 Utilities

This group shows what kind of utilities have houses and price change.

Utilities: Type of utilities available

   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO  Electricity only
   

Heating: Type of heating

   Floor    Floor Furnace
   GasA Gas forced warm air furnace
   GasW Gas hot water or steam heat
   Grav Gravity furnace 
   OthW Hot water or steam heat other than gas
   Wall Wall furnace
   

CentralAir: Central air conditioning

   N    No
   Y    Yes

Electrical: Electrical system

   SBrkr    Standard Circuit Breakers & Romex
   FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
   FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
   FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
   Mix  Mixed

Functional: Home functionality (Assume typical unless deductions are warranted)

   Typ  Typical Functionality
   Min1 Minor Deductions 1
   Min2 Minor Deductions 2
   Mod  Moderate Deductions
   Maj1 Major Deductions 1
   Maj2 Major Deductions 2
   Sev  Severely Damaged
   Sal  Salvage only
   
util<-grid.arrange(temp3[[7]],temp3[[24]],temp3[[25]], temp3[[26]],  temp3[[27]])

Insights:

  • Except for one house, all houses have AllPub all public utilities (E,G,W,& S).

  • Basically houses have GasA gas forced warm air furnace for heating.

  • Central air conditioning there is in houses.

  • Homes are equipped SBrkr standard circuit breakers & romex.

  • In the main typical functional: deductions are warranted.

6.8 Garage

Here we are talking about the garage:

GarageType: Garage location

   2Types   More than one type of garage
   Attchd   Attached to home
   Basment  Basement Garage
   BuiltIn  Built-In (Garage part of house - typically has room above garage)
   CarPort  Car Port
   Detchd   Detached from home
   NA   No Garage

GarageFinish: Interior finish of the garage

   Fin  Finished
   RFn  Rough Finished  
   Unf  Unfinished
   NA   No Garage

PavedDrive: Paved driveway

   Y    Paved 
   P    Partial Pavement
   N    Dirt/Gravel
   
garage<-grid.arrange(temp4[[28]],temp4[[29]], temp4[[30]])

Insights:

  • 60% houses have attached to home garage location 27% is detached from home.

    • Houses wiht Built-In garage is most expensive. In the second place is attached
  • 41% houses have unfinished garage, so their price is lower than median price among all houses.

    • Median price of houses with finished and rough finished garage is higher than median price among all houses.
  • In the most houses have paved driveway.

6.9 Garage