an image caption Source: https://www.forbes.com/sites/markgreene/2019/10/11/how-to-buy-a-house-with-10000/
It’s a Kaggle competition House Prices: Advanced Regression Techniques The Ames Housing dataset was compiled by Dean De Cock for use in data science education. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. At this competition, I predict the final price of each home.
library(data.table) #read a dataset
library(dplyr) #manipulate data
library(ggplot2) #draw plots
library(visdat) #visualise missing data
library(corrplot) #built correlation matrix
library(Metrics) #evaluate models
library(caret) #work with missing data and built a model
library(glmnet) #built ridge and lasso regression
library(xgboost)#built a model
library(gbm)
library(randomForest)
train<-fread("train.csv")
test<-fread("test.csv")
price<-fread("sample_submission.csv")
Our dataset has 81 variables. Training data has 1460 rows and test data has 1459 rows.
dim(train)
## [1] 1460 81
dim(test) #without price
## [1] 1459 80
To work with missing data, I combine test and train data, remove Id.
test_price<-cbind(test, SalePrice=price$SalePrice) #combine test data with price
data<-rbind(train[,-1], test_price[,-1]) #combine train and test data without Id
Here we can see a structure of our data.
data<-as.data.frame(data)
str(data[,c(1:10, 80)])
## 'data.frame': 2919 obs. of 11 variables:
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage: int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour: chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ SalePrice : num 208500 181500 223500 140000 250000 ...
Using library(visdat), I visualise missing data
vis_miss(data)
table(is.na(data))
##
## FALSE TRUE
## 219555 13965
Dataset is missing 6% of information
To impute missing values in categorical variables, I use mode:
#Finction to find out mode
stat_mode <- function(x){
t1<-table(x)
result<-as.vector(names(which(t1==max(t1))))
return (result)
}
MSZoning: Identifies the general zoning classification of the sale
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
Mode of MSZoning is stat_mode(data$MSZoning)
, so I impute this value.
#MSZoning NA->mode RL
data$MSZoning[is.na(data$MSZoning)]<-stat_mode(data$MSZoning)
LotFrontage: Linear feet of street connected to property I guess that NAs mean 0.
#LotFrontage NA->0
data$LotFrontage[is.na(data$LotFrontage)]<-0
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
I swith NAs to “None”.
#Alley NA->None
data$Alley[is.na(data$Alley)]<-"None"
Utilities,Exterior1st, Exterior2nd, Electrical, KitchenQual,Functional switch to mode.
Amount of NAs MasVnrArea is equal to MasVnrType.
I think MasVnrType NAs are None. MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea: Masonry veneer area in square feet. If MasVnrType is None then MasVnrArea is equal to 0 square feet
#NA MasVnrType->None
data$MasVnrType[is.na(data$MasVnrType)]<-"None"
#If MasVnrType->None, MasVnrArea==0
data$MasVnrArea<-ifelse(data$MasVnrType=="None", 0, data$MasVnrArea)
A lot of missing values there are in variables which describe basement: height, condition, level walls, rating of basement finished area, finished square feet, unfinished square feet of basement area and total square feet of basement area. We see that amount of missin values in BsmtQual is equal to BsmtCond.I will not show here all manipulation with basement’s missing values. The main idea is if BsmtQual is None then all basement parametrs are None or 0.
If Bath is NA, then 0. Also, in all houses without Fireplaces FireplaceQu is None.
data$BsmtFullBath[is.na(data$BsmtFullBath)]<-0
data$BsmtHalfBath[is.na(data$BsmtHalfBath)]<-0
data$FireplaceQu<-ifelse(data$Fireplaces==0, "None",data$FireplaceQu )
In variables that describe garage there is error.There are some houses where YearBuilt>GarageYrBlt. I switch it to YearBuilt.
garage<-data %>%
filter(YearBuilt>GarageYrBlt) %>%
select(GarageType,YearBuilt, GarageYrBlt)
garage
## GarageType YearBuilt GarageYrBlt
## 1 Detchd 1927 1920
## 2 Detchd 1910 1900
## 3 BuiltIn 1967 1961
## 4 BuiltIn 2005 2003
## 5 Detchd 1950 1949
## 6 BuiltIn 1959 1954
## 7 Detchd 1930 1925
## 8 Detchd 1923 1922
## 9 Detchd 1963 1962
## 10 Attchd 1959 1956
## 11 Attchd 2010 2009
## 12 Detchd 1935 1920
## 13 Detchd 1978 1960
## 14 Detchd 1941 1940
## 15 Detchd 1935 1926
## 16 Attchd 1945 1925
## 17 Attchd 2006 2005
## 18 Attchd 2006 2005
Manipulations with garage parametrs are by the same principle by the same principle like with basement.
I change character variables, Month Sold, MSSubClass to factor.
data$MoSold<-as.factor(data$MoSold)
data$MSSubClass<-as.factor(data$MSSubClass)
data$GarageCars<-as.integer(data$GarageCars)
#switch character to factor
data<-data %>%
mutate_if(is.character, as.factor)
str(data)
## 'data.frame': 2919 obs. of 80 variables:
## $ MSSubClass : Factor w/ 16 levels "20","30","40",..: 6 1 6 7 6 5 1 6 5 16 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : num 65 80 68 60 84 85 75 0 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 3 levels "Grvl","None",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 3 3 5 3 3 1 3 5 5 ...
## $ BsmtCond : Factor w/ 5 levels "Fa","Gd","None",..: 5 5 5 2 5 5 5 5 5 5 ...
## $ BsmtExposure : Factor w/ 5 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 7 3 ...
## $ BsmtFinSF1 : num 706 978 486 216 655 ...
## $ BsmtFinType2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 2 7 7 ...
## $ BsmtFinSF2 : num 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num 856 1262 920 756 1145 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ 1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ 2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : num 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 6 levels "Ex","Fa","Gd",..: 4 6 6 3 6 4 3 6 6 6 ...
## $ GarageType : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 4 levels "Fin","None","RFn",..: 3 3 3 4 3 4 3 3 4 3 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 2 3 ...
## $ GarageCond : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ 3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Fence : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
## $ MiscFeature : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : Factor w/ 12 levels "1","2","3","4",..: 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : num 208500 181500 223500 140000 250000 ...
Quality and condition have the same gradation:
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
None None
So I will use integer variables:
5 Excellent
4 Good
3 Typical/Average
2 Fair
1 Poor
0 None
an image caption Source: https://www.homelight.com/blog/buyer-what-is-a-half-bath/
BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade
A half-bath, also known as a powder room or guest bath, has only two of the four main bathroom components—typically a toilet and sink.I create new variable Total Barhroom that is equal to sum of bath variables.
data$Totbath<-data$BsmtFullBath+data$BsmtHalfBath*0.5+data$FullBath+data$HalfBath*0.5
This new variable is Year Sold minus Remodel date.To know that a house had remodeling, I add variable that shows has a house been remodeled. Variable is.new help us to find out a house is new or old.
data$Age<-data$YrSold-data$YearRemodAdd
data$Remod<-ifelse(data$YearBuilt==data$YearRemodAdd, 0,1) #0-No, 1-Remod
data$is.new<-ifelse(data$YearBuilt==data$YrSold, 1,0) #1-new, 0-old
an image caption Source: https://www.housebeautiful.com/home-remodeling/diy-projects/a32585120/how-to-screen-in-a-porch/
There are 4 variables that contain information about area in square feet of different type of porches.I will unite this information in one variable Total Porch.
data$TotPorsh<-data$OpenPorchSF+data$EnclosedPorch+data$`3SsnPorch`+data$ScreenPorch
I created a matrix that contains variables with high correlation with price.
train<-as.data.frame(data[1:1460,])
test<-as.data.frame(data[1461:nrow(data),])
train_num<-train[sapply(train,is.numeric)]
corr_matrix<-cor(train_num)
cor_sorted <- as.matrix(sort(corr_matrix[,'SalePrice'], decreasing = TRUE))
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
corr_matrix <- corr_matrix[CorHigh, CorHigh]
corrplot.mixed(corr_matrix, tl.col="black", tl.pos = "lt")
As we see, there is high correlation between some variables, for example, between GarageCars and GarageArea, GarageCond and GarageQual.
train_num_no_price<-train_num %>%
select(-SalePrice)
corr_matrix_n<-cor(train_num_no_price)
top.mat <- function(X, level=0.45, N=12, values=TRUE) {
X.nam <- row.names(X)
X.tri <- as.vector(lower.tri(X))
X.rep.g <- rep(X.nam, length(X.nam))[X.tri]
X.rep.e <- rep(X.nam, each=length(X.nam))[X.tri]
X.vec <- as.vector(X)[X.tri]
X.df <- data.frame(Var1=X.rep.g, Var2=X.rep.e, Value=X.vec)
{if (values)
{X.df <- X.df[abs(X.df$Value) >= level, ]
X.df <- X.df[order(-abs(X.df$Value)), ]}
else
{X.df <- X.df[order(-abs(X.df$Value)), ]
X.df <- X.df[1:N, ]}}
row.names(X.df) <- seq(1, along=X.df$Value)
return(X.df)
}
highcor<-top.mat(corr_matrix_n)
highcor %>%
filter(Value>0.75)
## Var1 Var2 Value
## 1 GarageCond GarageQual 0.9591716
## 2 PoolQC PoolArea 0.9370565
## 3 GarageArea GarageCars 0.8824754
## 4 FireplaceQu Fireplaces 0.8632412
## 5 GarageYrBlt YearBuilt 0.8453790
## 6 TotRmsAbvGrd GrLivArea 0.8254894
## 7 1stFlrSF TotalBsmtSF 0.8195300
highCor = findCorrelation(corr_matrix_n, cutoff = 0.75)
print("Имена этих переменных:")
## [1] "Имена этих переменных:"
names(train_num)[highCor]
## [1] "GrLivArea" "GarageCars" "YearBuilt" "TotalBsmtSF" "Totbath"
## [6] "FireplaceQu" "GarageQual" "PoolArea"
To determine variables that I will remove because of high correlation, I see correlation between SalePrice and some variables. I will remove variable with less coeficient.
cor(train$SalePrice, train$GarageCond)
## [1] 0.2631908
cor(train$SalePrice, train$GarageQual)
## [1] 0.2738391
cor(train$SalePrice, train$GarageArea)
## [1] 0.6234314
cor(train$SalePrice, train$GarageCars)
## [1] 0.6404092
cor(train$SalePrice, train$TotRmsAbvGrd)
## [1] 0.5337232
cor(train$SalePrice, train$GrLivArea)
## [1] 0.7086245
cor(train$SalePrice, train$`1stFlrSF`)
## [1] 0.6058522
cor(train$SalePrice, train$TotalBsmtSF)
## [1] 0.6135806
Also I search linear dependend variables.
linCombo <- findLinearCombos(corr_matrix_n)
names(train_num_no_price)[linCombo$remove]
## [1] "TotalBsmtSF" "GrLivArea" "Totbath" "Age" "TotPorsh"
I remove all variables except YrBlt that are part of TotalBsmtSF, GrLivArea, Totbath, Age, TotPorsh. Then I reorder variables in data frame.
remove_name<-which(colnames(data) %in% c("YearRemodAdd", 'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF',
'BsmtFullBath','BsmtHalfBath', 'HalfBath', 'FullBath',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'GarageCond', 'GarageArea',
'TotRmsAbvGrd','1stFlrSF','2ndFlrSF'))
#new data (without remove variables)
data<-data[,-remove_name]
#reordered data
data<-data %>%
select(MSSubClass:LowQualFinSF,Totbath:is.new, GrLivArea:GarageQual, TotPorsh, PavedDrive:SalePrice)
train<-data[1:1460,]
test<-data[1461:nrow(data),]
train_num<-train[sapply(train,is.numeric)]
corr_matrix<-cor(train_num)
cor_sorted <- as.matrix(sort(corr_matrix[,'SalePrice'], decreasing = TRUE))
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
corr_matrix <- corr_matrix[CorHigh, CorHigh]
corrplot.mixed(corr_matrix, tl.col="black", tl.pos = "lt")
## [1] 35
In the dataset, there are 35 factor variables.All factor variables I split into 11 groups. Each group describes some parametrs of a real estate. The first group is type and style of dwelling. I created a function to show interconnection between factor variables and price.
factors <- sapply(train, function(x) is.factor(x))
factors_only<- train[,factors]
bar_price <- function(data, var,color=mycol){
ggplot(data,aes(fct_reorder(!!sym(var),SalePrice, .desc = TRUE),
SalePrice, fill = !!sym(var))) +
stat_summary(aes(y = SalePrice), fun = "median", geom = "bar")+
geom_hline(yintercept = median(data$SalePrice), color="red")+
scale_fill_manual(values = rep(color, 25))+
geom_label(stat = "count", aes(label = ..count.., y = ..count..),
fill="white")+
ylab("SalePrice")+
xlab(var)+
theme_bw()+
theme(legend.position = "none")
}
temp1 <- lapply(names(factors_only), bar_price, data = train, "#202040")
temp2<-lapply(names(factors_only), bar_price, data = train, "#4ea0ae")
temp3<-lapply(names(factors_only), bar_price, data = train, "#158467")
temp4<-lapply(names(factors_only), bar_price, data = train, "#ffd571")
Every bar plot shows median price in each category of factor variable. Red line is median price of factor variable.Label at the bottom of each bar means amount of houses in the category.
The first group is type and style of dwelling.
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
class_zone<-grid.arrange(temp1[[1]],temp1[[2]])
my_summarise <- function(data=train, group_var) {
data %>%
group_by({{ group_var }}) %>%
dplyr::summarise(median = mean(SalePrice), n=n()) %>%
arrange(desc(n))
}
MSSubClass<-train %>%
my_summarise(MSSubClass)
MSZoning<-train %>%
my_summarise(MSZoning)
Insights:
3 most popular type of dwelling is:
20 1-STORY 1946 & NEWER ALL STYLES
60 2-STORY 1946 & NEWER
50 1-1/2 STORY FINISHED ALL AGES
3 most expensive category is:
60 2-STORY 1946 & NEWER
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
80 SPLIT OR MULTI-LEVEL
Most popular zoning is RL Residential Low Density with median price $191005
To this group I included all the information about the lot and the area:
Street: Type of road access to property
Grvl Gravel
Pave Paved
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
None No alley access
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
land<-grid.arrange(temp2[[3]],temp2[[4]],temp2[[5]],temp2[[6]], temp2[[8]], temp2[[9]])
Insights:
With the exception of 6 houses road access to property is paved.
Most houses have not alley.
In the main shape of lot is regular but 30% of all properties has slightly irregular shape.
Mostly flatness of the property Lvl Near Flat/Level. Its median price is equal to median price among all houses.
72% houses have inside lot configuration.
95% of houses have Gentle slope
This group includes type and style of dwelling:
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
type_style<-grid.arrange(temp4[[13]],temp4[[14]])
Insights:
Most popular buiding type is 1Fam Single-family Detached. Its median price is a little higher than median price among all houses
One story houses are most popular. In second place in popularity are two story houses, whose price is higher than median price among all houses unlike one story houses.
This group brings together type of roof, roof material.
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
roof_exterior<-grid.arrange(temp1[[15]],temp1[[16]], temp1[[19]])
Insights:
78% houses have gable roof. Their price is nearly equal to median price among all houses.
20% houses have hip roof. Their price is a little higher than median price among all houses.
Almost all houses have CompShg standard (Composite) shingle as roof material.
60% of houses do not have masonry veneer
Houses with stone and brick face masonry veneer type is more exspensive.
This group describe basement and foundation parametrs:
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1 and BsmtFinType2(if multiple types): Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
bsmt<-grid.arrange(temp2[[20]],temp2[[21]],temp2[[22]],temp2[[23]])
Insights:
44% hoses have PConc poured contrete foundation and 43% hoses have CBlock cinder block foundation.Poured contrete foundation is more expensive than cinder block foundation.
The higher the rating of basement finished area, the higher the price.
This group shows what kind of utilities have houses and price change.
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
CentralAir: Central air conditioning
N No
Y Yes
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
util<-grid.arrange(temp3[[7]],temp3[[24]],temp3[[25]], temp3[[26]], temp3[[27]])
Insights:
Except for one house, all houses have AllPub all public utilities (E,G,W,& S).
Basically houses have GasA gas forced warm air furnace for heating.
Central air conditioning there is in houses.
Homes are equipped SBrkr standard circuit breakers & romex.
In the main typical functional: deductions are warranted.
Here we are talking about the garage:
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
PavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
garage<-grid.arrange(temp4[[28]],temp4[[29]], temp4[[30]])
Insights:
60% houses have attached to home garage location 27% is detached from home.
41% houses have unfinished garage, so their price is lower than median price among all houses.
In the most houses have paved driveway.