House Price Predictors
Kaggle.com, House Prices: Advanced Regression Techniques competition.. A playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
The data set contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, and this challenge will predict the final price of each home.
Load Required Libraries
library(ggplot2)
library(plyr)
library(tidyverse)
library(reactable)
library(scales)
library(summarytools)
library(plotly)
library(psych)
library(car)
library(corrr)
library(corrplot)
library(correlation)
library(Matrix)
library(moments)
library(MASS)
library(ggrepel)
library(psych)
library(caret)
library(Hmisc)
library(matlib)
library(graphics)
library(ggpubr)
library(leaps)
The data sets, train.csv
and test.csv
was transferred to a personal Github respository, loaded with read_csv
function and transformed to a data frame.
1. UNIVARIATE DESCRIPTIVE Statistics and appropriate plots for the training data set.
The Univariate descriptive statistics analysis will examine house prices train data set distribution
, central tendency
, and variability
base on these following functions: 1. summary(): base function(generic)
, provides the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. 2. descr(): summarytools
package, calculates mean, sd, min, Q1, median, Q3, max, MAD, IQR, CV, skewness, SE.skewness, and kurtosis on numerical vectors. (*) Not available when using sampling weights. 3. freq(): summarytools
package, displays weighted or unweighted frequencies, including NA counts and proportions. 4. describeBy(): psych
package, report basic summary statistics by a grouping variable.
Understanding the data
## Train has 1460 rows and 81 columns.
## Test has 1459 rows and 80 columns.
Preview of the dataset and its structure:
library(reactable)
reactable(head(df_train), striped = TRUE, bordered = TRUE, wrap = FALSE) #first 6 observations
The dataset contains 1460 observations** and 81, identifying the type of dwelling involved in the House Prices
sale. The dataset have different variable types: numeric (discrete)
and character (ordinal)
that have limited number of unique character strings to create a factor
variable.
Retrieve the column names of the data set:
colnames(df_train)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
Check data frame for NULL and NA values:
is.null(df_train)
## [1] FALSE
any(is.na(df_train))
## [1] TRUE
The high-level summary of the training dataset from Kaggle, without converting character variables to factor or removal of NA’s/Null values in the data frame.
The SalePrice variable is continuous and a log transform will make it as “normal” as possible for the statistical analysis.
#convert a monetary variable to log, it helps reduce the impact of outliners and decreases the skewness in the data set.
<- df_train %>% mutate(log_SalePrice = log(SalePrice)) log_train
Histogram of SalePrice (unfiltered and filtered): the plot shows a right skew with a median sale price distribution in the range of 163,000 and 180,921.
par(mfrow= c(2,1))
hist(log_train$SalePrice, col = "darkmagenta", main = "Histogram of Sale Price")
hist(log_train$log_SalePrice, col = "goldenrod", main = "Histogram of Log Transform Sale Price")
The summary
of the variable (SalePrice) shows the rescaled observations that will provide a homogeneous grouping for normal distribution.
#actual observations of Sale Price
summary(log_train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
#log transformation of SalePrice
summary(log_train$log_SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.46 11.78 12.00 12.02 12.27 13.53
Numeric summary of the data for the independent variables and the dependent variables:
summary(log_train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice log_SalePrice
## Min. : 34900 Min. :10.46
## 1st Qu.:129975 1st Qu.:11.78
## Median :163000 Median :12.00
## Mean :180921 Mean :12.02
## 3rd Qu.:214000 3rd Qu.:12.27
## Max. :755000 Max. :13.53
##
The descr() function is a descriptive (univariate) statistics for numerical vectors.
#summarytools package
descr(select_if(log_train, is.numeric), style = "rmarkdown")
## ### Descriptive Statistics
##
## | | BedroomAbvGr | BsmtFinSF1 | BsmtFinSF2 | BsmtFullBath | BsmtHalfBath |
## |----------------:|-------------:|-----------:|-----------:|-------------:|-------------:|
## | **Mean** | 2.87 | 443.64 | 46.55 | 0.43 | 0.06 |
## | **Std.Dev** | 0.82 | 456.10 | 161.32 | 0.52 | 0.24 |
## | **Min** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
## | **Q1** | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 |
## | **Median** | 3.00 | 383.50 | 0.00 | 0.00 | 0.00 |
## | **Q3** | 3.00 | 712.50 | 0.00 | 1.00 | 0.00 |
## | **Max** | 8.00 | 5644.00 | 1474.00 | 3.00 | 2.00 |
## | **MAD** | 0.00 | 568.58 | 0.00 | 0.00 | 0.00 |
## | **IQR** | 1.00 | 712.25 | 0.00 | 1.00 | 0.00 |
## | **CV** | 0.28 | 1.03 | 3.47 | 1.22 | 4.15 |
## | **Skewness** | 0.21 | 1.68 | 4.25 | 0.59 | 4.09 |
## | **SE.Skewness** | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
## | **Kurtosis** | 2.21 | 11.06 | 20.01 | -0.84 | 16.31 |
## | **N.Valid** | 1460.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
##
## Table: Table continues below
##
##
##
## | | BsmtUnfSF | EnclosedPorch | Fireplaces | FullBath | GarageArea |
## |----------------:|----------:|--------------:|-----------:|---------:|-----------:|
## | **Mean** | 567.24 | 21.95 | 0.61 | 1.57 | 472.98 |
## | **Std.Dev** | 441.87 | 61.12 | 0.64 | 0.55 | 213.80 |
## | **Min** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
## | **Q1** | 223.00 | 0.00 | 0.00 | 1.00 | 333.00 |
## | **Median** | 477.50 | 0.00 | 1.00 | 2.00 | 480.00 |
## | **Q3** | 808.00 | 0.00 | 1.00 | 2.00 | 576.00 |
## | **Max** | 2336.00 | 552.00 | 3.00 | 3.00 | 1418.00 |
## | **MAD** | 426.99 | 0.00 | 1.48 | 0.00 | 177.91 |
## | **IQR** | 585.00 | 0.00 | 1.00 | 1.00 | 241.50 |
## | **CV** | 0.78 | 2.78 | 1.05 | 0.35 | 0.45 |
## | **Skewness** | 0.92 | 3.08 | 0.65 | 0.04 | 0.18 |
## | **SE.Skewness** | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
## | **Kurtosis** | 0.46 | 10.37 | -0.22 | -0.86 | 0.90 |
## | **N.Valid** | 1460.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
##
## Table: Table continues below
##
##
##
## | | GarageCars | GarageYrBlt | GrLivArea | HalfBath | Id | KitchenAbvGr |
## |----------------:|-----------:|------------:|----------:|---------:|--------:|-------------:|
## | **Mean** | 1.77 | 1978.51 | 1515.46 | 0.38 | 730.50 | 1.05 |
## | **Std.Dev** | 0.75 | 24.69 | 525.48 | 0.50 | 421.61 | 0.22 |
## | **Min** | 0.00 | 1900.00 | 334.00 | 0.00 | 1.00 | 0.00 |
## | **Q1** | 1.00 | 1961.00 | 1129.00 | 0.00 | 365.50 | 1.00 |
## | **Median** | 2.00 | 1980.00 | 1464.00 | 0.00 | 730.50 | 1.00 |
## | **Q3** | 2.00 | 2002.00 | 1777.50 | 1.00 | 1095.50 | 1.00 |
## | **Max** | 4.00 | 2010.00 | 5642.00 | 2.00 | 1460.00 | 3.00 |
## | **MAD** | 0.00 | 31.13 | 483.33 | 0.00 | 541.15 | 0.00 |
## | **IQR** | 1.00 | 41.00 | 647.25 | 1.00 | 729.50 | 0.00 |
## | **CV** | 0.42 | 0.01 | 0.35 | 1.31 | 0.58 | 0.21 |
## | **Skewness** | -0.34 | -0.65 | 1.36 | 0.67 | 0.00 | 4.48 |
## | **SE.Skewness** | 0.06 | 0.07 | 0.06 | 0.06 | 0.06 | 0.06 |
## | **Kurtosis** | 0.21 | -0.42 | 4.86 | -1.08 | -1.20 | 21.42 |
## | **N.Valid** | 1460.00 | 1379.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 94.45 | 100.00 | 100.00 | 100.00 | 100.00 |
##
## Table: Table continues below
##
##
##
## | | log_SalePrice | LotArea | LotFrontage | LowQualFinSF | MasVnrArea |
## |----------------:|--------------:|----------:|------------:|-------------:|-----------:|
## | **Mean** | 12.02 | 10516.83 | 70.05 | 5.84 | 103.69 |
## | **Std.Dev** | 0.40 | 9981.26 | 24.28 | 48.62 | 181.07 |
## | **Min** | 10.46 | 1300.00 | 21.00 | 0.00 | 0.00 |
## | **Q1** | 11.77 | 7549.00 | 59.00 | 0.00 | 0.00 |
## | **Median** | 12.00 | 9478.50 | 69.00 | 0.00 | 0.00 |
## | **Q3** | 12.27 | 11603.00 | 80.00 | 0.00 | 166.00 |
## | **Max** | 13.53 | 215245.00 | 313.00 | 572.00 | 1600.00 |
## | **MAD** | 0.36 | 2962.23 | 16.31 | 0.00 | 0.00 |
## | **IQR** | 0.50 | 4048.00 | 21.00 | 0.00 | 166.00 |
## | **CV** | 0.03 | 0.95 | 0.35 | 8.32 | 1.75 |
## | **Skewness** | 0.12 | 12.18 | 2.16 | 8.99 | 2.66 |
## | **SE.Skewness** | 0.06 | 0.06 | 0.07 | 0.06 | 0.06 |
## | **Kurtosis** | 0.80 | 202.26 | 17.34 | 82.83 | 10.03 |
## | **N.Valid** | 1460.00 | 1460.00 | 1201.00 | 1460.00 | 1452.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 82.26 | 100.00 | 99.45 |
##
## Table: Table continues below
##
##
##
## | | MiscVal | MoSold | MSSubClass | OpenPorchSF | OverallCond | OverallQual |
## |----------------:|---------:|--------:|-----------:|------------:|------------:|------------:|
## | **Mean** | 43.49 | 6.32 | 56.90 | 46.66 | 5.58 | 6.10 |
## | **Std.Dev** | 496.12 | 2.70 | 42.30 | 66.26 | 1.11 | 1.38 |
## | **Min** | 0.00 | 1.00 | 20.00 | 0.00 | 1.00 | 1.00 |
## | **Q1** | 0.00 | 5.00 | 20.00 | 0.00 | 5.00 | 5.00 |
## | **Median** | 0.00 | 6.00 | 50.00 | 25.00 | 5.00 | 6.00 |
## | **Q3** | 0.00 | 8.00 | 70.00 | 68.00 | 6.00 | 7.00 |
## | **Max** | 15500.00 | 12.00 | 190.00 | 547.00 | 9.00 | 10.00 |
## | **MAD** | 0.00 | 2.97 | 44.48 | 37.06 | 0.00 | 1.48 |
## | **IQR** | 0.00 | 3.00 | 50.00 | 68.00 | 1.00 | 2.00 |
## | **CV** | 11.41 | 0.43 | 0.74 | 1.42 | 0.20 | 0.23 |
## | **Skewness** | 24.43 | 0.21 | 1.40 | 2.36 | 0.69 | 0.22 |
## | **SE.Skewness** | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
## | **Kurtosis** | 697.64 | -0.41 | 1.56 | 8.44 | 1.09 | 0.09 |
## | **N.Valid** | 1460.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
##
## Table: Table continues below
##
##
##
## | | PoolArea | SalePrice | ScreenPorch | TotalBsmtSF | TotRmsAbvGrd |
## |----------------:|---------:|----------:|------------:|------------:|-------------:|
## | **Mean** | 2.76 | 180921.20 | 15.06 | 1057.43 | 6.52 |
## | **Std.Dev** | 40.18 | 79442.50 | 55.76 | 438.71 | 1.63 |
## | **Min** | 0.00 | 34900.00 | 0.00 | 0.00 | 2.00 |
## | **Q1** | 0.00 | 129950.00 | 0.00 | 795.50 | 5.00 |
## | **Median** | 0.00 | 163000.00 | 0.00 | 991.50 | 6.00 |
## | **Q3** | 0.00 | 214000.00 | 0.00 | 1298.50 | 7.00 |
## | **Max** | 738.00 | 755000.00 | 480.00 | 6110.00 | 14.00 |
## | **MAD** | 0.00 | 56338.80 | 0.00 | 347.67 | 1.48 |
## | **IQR** | 0.00 | 84025.00 | 0.00 | 502.50 | 2.00 |
## | **CV** | 14.56 | 0.44 | 3.70 | 0.41 | 0.25 |
## | **Skewness** | 14.80 | 1.88 | 4.11 | 1.52 | 0.67 |
## | **SE.Skewness** | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
## | **Kurtosis** | 222.19 | 6.50 | 18.34 | 13.18 | 0.87 |
## | **N.Valid** | 1460.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
##
## Table: Table continues below
##
##
##
## | | WoodDeckSF | X1stFlrSF | X2ndFlrSF | X3SsnPorch | YearBuilt |
## |----------------:|-----------:|----------:|----------:|-----------:|----------:|
## | **Mean** | 94.24 | 1162.63 | 346.99 | 3.41 | 1971.27 |
## | **Std.Dev** | 125.34 | 386.59 | 436.53 | 29.32 | 30.20 |
## | **Min** | 0.00 | 334.00 | 0.00 | 0.00 | 1872.00 |
## | **Q1** | 0.00 | 882.00 | 0.00 | 0.00 | 1954.00 |
## | **Median** | 0.00 | 1087.00 | 0.00 | 0.00 | 1973.00 |
## | **Q3** | 168.00 | 1391.50 | 728.00 | 0.00 | 2000.00 |
## | **Max** | 857.00 | 4692.00 | 2065.00 | 508.00 | 2010.00 |
## | **MAD** | 0.00 | 347.67 | 0.00 | 0.00 | 37.06 |
## | **IQR** | 168.00 | 509.25 | 728.00 | 0.00 | 46.00 |
## | **CV** | 1.33 | 0.33 | 1.26 | 8.60 | 0.02 |
## | **Skewness** | 1.54 | 1.37 | 0.81 | 10.28 | -0.61 |
## | **SE.Skewness** | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 |
## | **Kurtosis** | 2.97 | 5.71 | -0.56 | 123.06 | -0.45 |
## | **N.Valid** | 1460.00 | 1460.00 | 1460.00 | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
##
## Table: Table continues below
##
##
##
## | | YearRemodAdd | YrSold |
## |----------------:|-------------:|--------:|
## | **Mean** | 1984.87 | 2007.82 |
## | **Std.Dev** | 20.65 | 1.33 |
## | **Min** | 1950.00 | 2006.00 |
## | **Q1** | 1967.00 | 2007.00 |
## | **Median** | 1994.00 | 2008.00 |
## | **Q3** | 2004.00 | 2009.00 |
## | **Max** | 2010.00 | 2010.00 |
## | **MAD** | 19.27 | 1.48 |
## | **IQR** | 37.00 | 2.00 |
## | **CV** | 0.01 | 0.00 |
## | **Skewness** | -0.50 | 0.10 |
## | **SE.Skewness** | 0.06 | 0.06 |
## | **Kurtosis** | -1.27 | -1.19 |
## | **N.Valid** | 1460.00 | 1460.00 |
## | **Pct.Valid** | 100.00 | 100.00 |
The freq()
function will create a frequency table for the data set variable Heating, showing frequencies, proportions, as well as missing data information.
Heating | Type of heating |
---|---|
Floor | Floor Furnace |
GasA | Gas forced warm air furnace |
GasW | Gas hot water or steam heat |
Grav | Gravity furnace |
OthW | Hot water or steam heat other than gas |
Wall | Wall furnace |
#report.nas = FALSE argument removes information about missing values
#summarytools package - freq() function
<- freq(log_train$Heating, report.nas = TRUE, style = "rmarkdown")
house.heat house.heat
## ### Frequencies
## #### log_train$Heating
## **Type:** Character
##
## | | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## | **Floor** | 1 | 0.068 | 0.068 | 0.068 | 0.068 |
## | **GasA** | 1428 | 97.808 | 97.877 | 97.808 | 97.877 |
## | **GasW** | 18 | 1.233 | 99.110 | 1.233 | 99.110 |
## | **Grav** | 7 | 0.479 | 99.589 | 0.479 | 99.589 |
## | **OthW** | 2 | 0.137 | 99.726 | 0.137 | 99.726 |
## | **Wall** | 4 | 0.274 | 100.000 | 0.274 | 100.000 |
## | **\<NA\>** | 0 | | | 0.000 | 100.000 |
## | **Total** | 1460 | 100.000 | 100.000 | 100.000 | 100.000 |
The describeBy()
function allows to report several summary statistics (i.e., number of valid cases, mean, standard deviation, median, trimmed mean and others) by a grouping variable as depicted in the Foundation
column in data frame, log_train.char
.
Foundation | Type of foundation |
---|---|
BrkTil | Brick & Tile |
CBlock | Cinder Block |
PConc | Poured Concrete |
Slab | Slab |
Stone | Stone |
Wood | Wood |
#psych library
describeBy(
log_train,$Foundation # grouping variable
log_train )
##
## Descriptive statistics by group
## group: BrkTil
## vars n mean sd median trimmed mad
## Id 1 146 735.23 439.63 680.00 736.48 548.56
## MSSubClass 2 146 58.46 37.06 50.00 51.10 29.65
## MSZoning* 3 146 3.46 0.63 4.00 3.53 0.00
## LotFrontage 4 138 60.96 15.62 60.00 58.91 11.86
## LotArea 5 146 9159.17 4773.15 8510.00 8428.81 3395.15
## Street* 6 146 1.00 0.00 1.00 1.00 0.00
## Alley* 7 35 1.26 0.44 1.00 1.21 0.00
## LotShape* 8 146 3.50 1.10 4.00 3.74 0.00
## LandContour* 9 146 3.58 1.02 4.00 3.84 0.00
## Utilities* 10 146 1.00 0.00 1.00 1.00 0.00
## LotConfig* 11 146 3.29 1.26 4.00 3.47 0.00
## LandSlope* 12 146 1.05 0.28 1.00 1.00 0.00
## Neighborhood* 13 146 5.01 2.84 5.00 4.90 2.97
## Condition1* 14 146 2.86 0.81 3.00 2.92 0.00
## Condition2* 15 146 2.98 0.30 3.00 3.00 0.00
## BldgType* 16 146 1.06 0.24 1.00 1.00 0.00
## HouseStyle* 17 146 2.97 1.94 3.00 2.85 2.97
## OverallQual 18 146 5.45 1.25 5.00 5.45 1.48
## OverallCond 19 146 6.20 1.57 6.00 6.29 1.48
## YearBuilt 20 146 1921.02 13.93 1922.00 1922.47 8.90
## YearRemodAdd 21 146 1971.62 24.31 1950.00 1970.07 0.00
## RoofStyle* 22 146 1.18 0.56 1.00 1.01 0.00
## RoofMatl* 23 146 1.01 0.08 1.00 1.00 0.00
## Exterior1st* 24 146 6.32 2.09 7.00 6.65 1.48
## Exterior2nd* 25 146 7.88 2.73 9.00 8.18 1.48
## MasVnrType* 26 146 2.97 0.20 3.00 3.00 0.00
## MasVnrArea 27 146 7.00 49.67 0.00 0.00 0.00
## ExterQual* 28 146 3.88 0.42 4.00 4.00 0.00
## ExterCond* 29 146 3.64 0.67 4.00 3.80 0.00
## Foundation* 30 146 1.00 0.00 1.00 1.00 0.00
## BsmtQual* 31 145 3.62 0.72 4.00 3.78 0.00
## BsmtCond* 32 145 3.52 1.05 4.00 3.77 0.00
## BsmtExposure* 33 145 3.86 0.42 4.00 3.97 0.00
## BsmtFinType1* 34 145 4.85 1.73 6.00 5.15 0.00
## BsmtFinSF1 35 146 165.84 257.10 0.00 113.31 0.00
## BsmtFinType2* 36 145 4.90 0.56 5.00 5.00 0.00
## BsmtFinSF2 37 146 19.73 101.34 0.00 0.00 0.00
## BsmtUnfSF 38 146 629.05 318.62 673.00 637.44 323.95
## TotalBsmtSF 39 146 814.62 232.18 793.00 809.52 181.62
## Heating* 40 146 1.19 0.57 1.00 1.03 0.00
## HeatingQC* 41 146 2.50 1.23 3.00 2.50 1.48
## CentralAir* 42 146 1.72 0.45 2.00 1.77 0.00
## Electrical* 43 146 4.13 1.59 5.00 4.40 0.00
## X1stFlrSF 44 146 975.08 241.23 941.50 950.92 215.72
## X2ndFlrSF 45 146 455.01 404.55 513.00 418.34 474.43
## LowQualFinSF 46 146 21.99 94.04 0.00 0.00 0.00
## GrLivArea 47 146 1452.08 564.36 1364.50 1386.21 510.01
## BsmtFullBath 48 146 0.20 0.40 0.00 0.13 0.00
## BsmtHalfBath 49 146 0.03 0.18 0.00 0.00 0.00
## FullBath 50 146 1.33 0.54 1.00 1.26 0.00
## HalfBath 51 146 0.21 0.41 0.00 0.14 0.00
## BedroomAbvGr 52 146 2.92 0.90 3.00 2.86 1.48
## KitchenAbvGr 53 146 1.08 0.29 1.00 1.00 0.00
## KitchenQual* 54 146 3.55 0.79 4.00 3.72 0.00
## TotRmsAbvGrd 55 146 6.55 1.68 6.00 6.43 1.48
## Functional* 56 146 5.64 1.07 6.00 5.96 0.00
## Fireplaces 57 146 0.47 0.62 0.00 0.37 0.00
## FireplaceQu* 58 58 3.40 0.90 3.00 3.33 0.00
## GarageType* 59 129 4.64 0.96 5.00 4.91 0.00
## GarageYrBlt 60 129 1947.57 29.10 1937.00 1945.18 25.20
## GarageFinish* 61 129 2.89 0.42 3.00 3.00 0.00
## GarageCars 62 146 1.31 0.74 1.00 1.32 0.00
## GarageArea 63 146 344.66 209.00 308.00 337.04 161.60
## GarageQual* 64 129 4.21 1.31 5.00 4.38 0.00
## GarageCond* 65 129 3.56 1.03 4.00 3.80 0.00
## PavedDrive* 66 146 2.41 0.87 3.00 2.51 0.00
## WoodDeckSF 67 146 50.23 108.67 0.00 22.25 0.00
## OpenPorchSF 68 146 33.48 79.72 0.00 15.13 0.00
## EnclosedPorch 69 146 72.67 90.90 0.00 58.45 0.00
## X3SsnPorch 70 146 0.99 11.92 0.00 0.00 0.00
## ScreenPorch 71 146 17.82 65.98 0.00 0.00 0.00
## PoolArea 72 146 0.00 0.00 0.00 0.00 0.00
## PoolQC* 73 0 NaN NA NA NaN NA
## Fence* 74 34 2.53 0.90 3.00 2.57 0.00
## MiscFeature* 75 6 1.00 0.00 1.00 1.00 0.00
## MiscVal 76 146 27.53 142.85 0.00 0.00 0.00
## MoSold 77 146 6.45 2.50 6.00 6.39 1.48
## YrSold 78 146 2007.73 1.23 2008.00 2007.70 1.48
## SaleType* 79 146 6.71 1.16 7.00 7.00 0.00
## SaleCondition* 80 146 4.57 1.21 5.00 4.93 0.00
## SalePrice 81 146 132291.08 54592.39 125250.00 126199.13 35211.75
## log_SalePrice 82 146 11.72 0.37 11.74 11.72 0.29
## min max range skew kurtosis se
## Id 4.00 1444.00 1440.00 0.06 -1.25 36.38
## MSSubClass 30.00 190.00 160.00 2.64 6.89 3.07
## MSZoning* 1.00 4.00 3.00 -1.06 1.46 0.05
## LotFrontage 30.00 130.00 100.00 1.61 3.62 1.33
## LotArea 3636.00 45600.00 41964.00 3.79 22.80 395.03
## Street* 1.00 1.00 0.00 NaN NaN 0.00
## Alley* 1.00 2.00 1.00 1.06 -0.89 0.07
## LotShape* 1.00 4.00 3.00 -1.76 1.17 0.09
## LandContour* 1.00 4.00 3.00 -2.05 2.30 0.08
## Utilities* 1.00 1.00 0.00 NaN NaN 0.00
## LotConfig* 1.00 4.00 3.00 -1.22 -0.49 0.10
## LandSlope* 1.00 3.00 2.00 5.50 31.29 0.02
## Neighborhood* 1.00 10.00 9.00 -0.03 -1.12 0.23
## Condition1* 1.00 6.00 5.00 -0.04 3.21 0.07
## Condition2* 1.00 5.00 4.00 -0.57 28.65 0.02
## BldgType* 1.00 2.00 1.00 3.61 11.09 0.02
## HouseStyle* 1.00 6.00 5.00 0.49 -1.25 0.16
## OverallQual 1.00 10.00 9.00 0.13 2.23 0.10
## OverallCond 1.00 9.00 8.00 -0.55 0.38 0.13
## YearBuilt 1872.00 1954.00 82.00 -1.10 1.82 1.15
## YearRemodAdd 1950.00 2008.00 58.00 0.33 -1.78 2.01
## RoofStyle* 1.00 4.00 3.00 3.09 8.54 0.05
## RoofMatl* 1.00 2.00 1.00 11.84 139.04 0.01
## Exterior1st* 1.00 9.00 8.00 -1.02 0.28 0.17
## Exterior2nd* 1.00 11.00 10.00 -0.86 -0.37 0.23
## MasVnrType* 1.00 3.00 2.00 -7.96 67.27 0.02
## MasVnrArea 0.00 435.00 435.00 7.17 51.82 4.11
## ExterQual* 1.00 4.00 3.00 -4.21 20.10 0.03
## ExterCond* 1.00 4.00 3.00 -1.88 2.94 0.06
## Foundation* 1.00 1.00 0.00 NaN NaN 0.00
## BsmtQual* 1.00 4.00 3.00 -1.66 1.41 0.06
## BsmtCond* 1.00 4.00 3.00 -1.83 1.48 0.09
## BsmtExposure* 1.00 4.00 3.00 -3.68 16.63 0.03
## BsmtFinType1* 1.00 6.00 5.00 -1.23 -0.04 0.14
## BsmtFinSF1 0.00 1128.00 1128.00 1.65 2.28 21.28
## BsmtFinType2* 1.00 5.00 4.00 -5.86 34.78 0.05
## BsmtFinSF2 0.00 692.00 692.00 5.22 26.75 8.39
## BsmtUnfSF 0.00 1470.00 1470.00 -0.20 -0.48 26.37
## TotalBsmtSF 0.00 1559.00 1559.00 0.16 1.64 19.21
## Heating* 1.00 4.00 3.00 3.16 9.78 0.05
## HeatingQC* 1.00 4.00 3.00 -0.08 -1.60 0.10
## CentralAir* 1.00 2.00 1.00 -0.97 -1.08 0.04
## Electrical* 1.00 5.00 4.00 -1.32 -0.18 0.13
## X1stFlrSF 520.00 1687.00 1167.00 0.91 0.67 19.96
## X2ndFlrSF 0.00 1818.00 1818.00 0.52 -0.07 33.48
## LowQualFinSF 0.00 572.00 572.00 4.41 18.75 7.78
## GrLivArea 520.00 3608.00 3088.00 1.19 1.77 46.71
## BsmtFullBath 0.00 1.00 1.00 1.50 0.24 0.03
## BsmtHalfBath 0.00 1.00 1.00 5.07 23.86 0.02
## FullBath 0.00 3.00 3.00 1.10 0.54 0.04
## HalfBath 0.00 1.00 1.00 1.39 -0.06 0.03
## BedroomAbvGr 1.00 5.00 4.00 0.39 -0.41 0.07
## KitchenAbvGr 1.00 3.00 2.00 4.00 16.74 0.02
## KitchenQual* 1.00 4.00 3.00 -1.71 2.07 0.07
## TotRmsAbvGrd 4.00 12.00 8.00 0.73 0.48 0.14
## Functional* 1.00 6.00 5.00 -3.04 8.39 0.09
## Fireplaces 0.00 2.00 2.00 0.98 -0.12 0.05
## FireplaceQu* 1.00 5.00 4.00 0.74 0.12 0.12
## GarageType* 1.00 5.00 4.00 -2.42 4.15 0.08
## GarageYrBlt 1900.00 2007.00 107.00 0.69 -0.90 2.56
## GarageFinish* 1.00 3.00 2.00 -3.85 13.72 0.04
## GarageCars 0.00 3.00 3.00 0.16 -0.27 0.06
## GarageArea 0.00 880.00 880.00 0.40 -0.12 17.30
## GarageQual* 1.00 5.00 4.00 -1.09 -0.72 0.12
## GarageCond* 1.00 4.00 3.00 -1.98 2.08 0.09
## PavedDrive* 1.00 3.00 2.00 -0.89 -1.09 0.07
## WoodDeckSF 0.00 509.00 509.00 2.44 5.46 8.99
## OpenPorchSF 0.00 547.00 547.00 4.02 19.48 6.60
## EnclosedPorch 0.00 330.00 330.00 1.00 -0.18 7.52
## X3SsnPorch 0.00 144.00 144.00 11.84 139.04 0.99
## ScreenPorch 0.00 480.00 480.00 4.66 24.56 5.46
## PoolArea 0.00 0.00 0.00 NaN NaN 0.00
## PoolQC* Inf -Inf -Inf NA NA NA
## Fence* 1.00 4.00 3.00 -0.70 -0.76 0.15
## MiscFeature* 1.00 1.00 0.00 NaN NaN 0.00
## MiscVal 0.00 1150.00 1150.00 5.63 33.61 11.82
## MoSold 1.00 12.00 11.00 0.22 -0.33 0.21
## YrSold 2006.00 2010.00 4.00 0.05 -1.10 0.10
## SaleType* 1.00 7.00 6.00 -4.11 15.98 0.10
## SaleCondition* 1.00 6.00 5.00 -2.47 4.34 0.10
## SalePrice 37900.00 475000.00 437100.00 2.31 10.37 4518.10
## log_SalePrice 10.54 13.07 2.53 0.02 1.42 0.03
## ------------------------------------------------------------
## group: CBlock
## vars n mean sd median trimmed mad
## Id 1 634 728.26 421.05 729.50 728.01 549.30
## MSSubClass 2 634 52.68 44.35 30.00 43.11 14.83
## MSZoning* 3 634 3.10 0.42 3.00 3.04 0.00
## LotFrontage 4 494 70.80 24.56 70.00 70.38 14.83
## LotArea 5 634 11272.36 13814.10 9600.00 9727.21 2816.94
## Street* 6 634 1.99 0.10 2.00 2.00 0.00
## Alley* 7 19 1.16 0.37 1.00 1.12 0.00
## LotShape* 8 634 3.01 1.40 4.00 3.14 0.00
## LandContour* 9 634 3.79 0.69 4.00 4.00 0.00
## Utilities* 10 634 1.00 0.04 1.00 1.00 0.00
## LotConfig* 11 634 3.28 1.21 4.00 3.47 0.00
## LandSlope* 12 634 1.08 0.32 1.00 1.00 0.00
## Neighborhood* 13 634 12.22 4.94 12.00 12.37 5.93
## Condition1* 14 634 2.95 0.79 3.00 2.96 0.00
## Condition2* 15 634 3.00 0.17 3.00 3.00 0.00
## BldgType* 16 634 1.41 1.02 1.00 1.12 0.00
## HouseStyle* 17 634 3.90 1.98 3.00 3.79 0.00
## OverallQual 18 634 5.42 0.96 5.00 5.40 1.48
## OverallCond 19 634 5.83 1.19 6.00 5.80 1.48
## YearBuilt 20 634 1961.25 16.72 1963.50 1962.92 14.08
## YearRemodAdd 21 634 1975.22 17.68 1972.00 1974.56 19.27
## RoofStyle* 22 634 2.47 0.90 2.00 2.35 0.00
## RoofMatl* 23 634 1.12 0.73 1.00 1.00 0.00
## Exterior1st* 24 634 8.02 2.67 7.00 7.99 1.48
## Exterior2nd* 25 634 8.72 3.00 8.00 8.74 2.97
## MasVnrType* 26 634 2.66 0.57 3.00 2.68 0.00
## MasVnrArea 27 634 89.56 148.31 0.00 58.35 0.00
## ExterQual* 28 634 3.86 0.43 4.00 3.99 0.00
## ExterCond* 29 634 4.69 0.77 5.00 4.88 0.00
## Foundation* 30 634 1.00 0.00 1.00 1.00 0.00
## BsmtQual* 31 625 3.72 0.51 4.00 3.80 0.00
## BsmtCond* 32 625 2.90 0.39 3.00 3.00 0.00
## BsmtExposure* 33 625 3.38 1.08 4.00 3.60 0.00
## BsmtFinType1* 34 625 3.24 1.87 3.00 3.18 2.97
## BsmtFinSF1 35 634 477.12 361.70 466.50 449.41 372.13
## BsmtFinType2* 36 625 5.49 1.18 6.00 5.81 0.00
## BsmtFinSF2 37 634 80.89 202.21 0.00 24.47 0.00
## BsmtUnfSF 38 634 443.47 360.11 388.00 401.89 343.22
## TotalBsmtSF 39 634 1001.49 335.76 954.50 984.78 266.13
## Heating* 40 634 2.01 0.13 2.00 2.00 0.00
## HeatingQC* 41 634 3.47 1.70 5.00 3.58 0.00
## CentralAir* 42 634 1.95 0.23 2.00 2.00 0.00
## Electrical* 43 634 2.80 0.59 3.00 2.99 0.00
## X1stFlrSF 44 634 1121.46 345.06 1056.00 1092.94 286.88
## X2ndFlrSF 45 634 228.71 363.28 0.00 161.02 0.00
## LowQualFinSF 46 634 5.32 47.28 0.00 0.00 0.00
## GrLivArea 47 634 1355.50 460.64 1263.00 1306.17 433.66
## BsmtFullBath 48 634 0.45 0.54 0.00 0.41 0.00
## BsmtHalfBath 49 634 0.09 0.29 0.00 0.00 0.00
## FullBath 50 634 1.32 0.52 1.00 1.28 0.00
## HalfBath 51 634 0.33 0.50 0.00 0.27 0.00
## BedroomAbvGr 52 634 2.87 0.83 3.00 2.84 0.00
## KitchenAbvGr 53 634 1.06 0.24 1.00 1.00 0.00
## KitchenQual* 54 634 3.70 0.60 4.00 3.83 0.00
## TotRmsAbvGrd 55 634 6.13 1.53 6.00 6.02 1.48
## Functional* 56 634 6.66 1.10 7.00 7.00 0.00
## Fireplaces 57 634 0.59 0.69 0.00 0.48 0.00
## FireplaceQu* 58 299 3.92 1.12 4.00 4.02 1.48
## GarageType* 59 588 3.34 1.85 2.00 3.19 0.00
## GarageYrBlt 60 588 1966.81 14.97 1967.00 1967.00 14.83
## GarageFinish* 61 588 2.46 0.72 3.00 2.57 0.00
## GarageCars 62 634 1.50 0.68 2.00 1.55 0.00
## GarageArea 63 634 410.85 192.21 440.00 413.15 195.70
## GarageQual* 64 588 3.93 0.36 4.00 4.00 0.00
## GarageCond* 65 588 4.89 0.55 5.00 5.00 0.00
## PavedDrive* 66 634 2.88 0.45 3.00 3.00 0.00
## WoodDeckSF 67 634 82.46 131.25 0.00 55.75 0.00
## OpenPorchSF 68 634 33.79 63.71 0.00 18.60 0.00
## EnclosedPorch 69 634 21.32 59.30 0.00 3.33 0.00
## X3SsnPorch 70 634 3.34 28.18 0.00 0.00 0.00
## ScreenPorch 71 634 18.96 60.77 0.00 0.24 0.00
## PoolArea 72 634 3.91 49.58 0.00 0.00 0.00
## PoolQC* 73 4 1.50 0.58 1.50 1.50 0.74
## Fence* 74 190 2.49 0.83 3.00 2.56 0.00
## MiscFeature* 75 35 2.89 0.53 3.00 3.00 0.00
## MiscVal 76 634 70.97 719.89 0.00 0.00 0.00
## MoSold 77 634 6.26 2.72 6.00 6.19 2.97
## YrSold 78 634 2007.87 1.35 2008.00 2007.83 1.48
## SaleType* 79 634 7.51 1.75 8.00 8.00 0.00
## SaleCondition* 80 634 4.60 1.17 5.00 4.97 0.00
## SalePrice 81 634 149805.71 48295.04 141500.00 144623.83 33358.50
## log_SalePrice 82 634 11.87 0.31 11.86 11.87 0.24
## min max range skew kurtosis se
## Id 2.00 1460.00 1458.00 -0.01 -1.22 16.72
## MSSubClass 20.00 190.00 170.00 1.57 1.87 1.76
## MSZoning* 1.00 4.00 3.00 -0.29 7.50 0.02
## LotFrontage 21.00 313.00 292.00 2.07 19.17 1.10
## LotArea 1300.00 215245.00 213945.00 10.14 122.71 548.63
## Street* 1.00 2.00 1.00 -10.11 100.35 0.00
## Alley* 1.00 2.00 1.00 1.73 1.06 0.09
## LotShape* 1.00 4.00 3.00 -0.72 -1.47 0.06
## LandContour* 1.00 4.00 3.00 -3.33 9.93 0.03
## Utilities* 1.00 2.00 1.00 25.06 627.01 0.00
## LotConfig* 1.00 4.00 3.00 -1.18 -0.47 0.05
## LandSlope* 1.00 3.00 2.00 4.09 17.32 0.01
## Neighborhood* 1.00 23.00 22.00 -0.17 -0.58 0.20
## Condition1* 1.00 8.00 7.00 1.79 11.08 0.03
## Condition2* 1.00 6.00 5.00 8.12 197.74 0.01
## BldgType* 1.00 5.00 4.00 2.44 4.75 0.04
## HouseStyle* 1.00 8.00 7.00 0.65 -0.61 0.08
## OverallQual 2.00 10.00 8.00 0.24 1.28 0.04
## OverallCond 2.00 9.00 7.00 0.20 0.16 0.05
## YearBuilt 1875.00 2009.00 134.00 -1.18 2.39 0.66
## YearRemodAdd 1950.00 2010.00 60.00 0.31 -1.04 0.70
## RoofStyle* 1.00 6.00 5.00 1.22 -0.03 0.04
## RoofMatl* 1.00 7.00 6.00 6.28 39.15 0.03
## Exterior1st* 1.00 13.00 12.00 0.28 -0.79 0.11
## Exterior2nd* 1.00 14.00 13.00 0.03 -0.72 0.12
## MasVnrType* 1.00 4.00 3.00 -0.45 0.00 0.02
## MasVnrArea 0.00 1115.00 1115.00 2.27 7.42 5.89
## ExterQual* 1.00 4.00 3.00 -3.71 15.85 0.02
## ExterCond* 1.00 5.00 4.00 -2.17 3.25 0.03
## Foundation* 1.00 1.00 0.00 NaN NaN 0.00
## BsmtQual* 1.00 4.00 3.00 -1.74 3.04 0.02
## BsmtCond* 1.00 3.00 2.00 -4.17 16.38 0.02
## BsmtExposure* 1.00 4.00 3.00 -1.39 0.27 0.04
## BsmtFinType1* 1.00 6.00 5.00 0.21 -1.47 0.07
## BsmtFinSF1 0.00 1880.00 1880.00 0.53 0.01 14.36
## BsmtFinType2* 1.00 6.00 5.00 -2.44 5.01 0.05
## BsmtFinSF2 0.00 1474.00 1474.00 3.05 10.18 8.03
## BsmtUnfSF 0.00 1907.00 1907.00 1.13 1.40 14.30
## TotalBsmtSF 0.00 2223.00 2223.00 0.43 1.37 13.33
## Heating* 1.00 4.00 3.00 8.79 115.84 0.01
## HeatingQC* 1.00 5.00 4.00 -0.44 -1.50 0.07
## CentralAir* 1.00 2.00 1.00 -3.95 13.65 0.01
## Electrical* 1.00 3.00 2.00 -2.64 5.13 0.02
## X1stFlrSF 438.00 2898.00 2460.00 0.94 1.37 13.70
## X2ndFlrSF 0.00 1540.00 1540.00 1.27 0.36 14.43
## LowQualFinSF 0.00 528.00 528.00 9.43 91.14 1.88
## GrLivArea 438.00 3447.00 3009.00 1.11 1.62 18.29
## BsmtFullBath 0.00 3.00 3.00 0.66 -0.38 0.02
## BsmtHalfBath 0.00 2.00 2.00 3.05 8.12 0.01
## FullBath 0.00 3.00 3.00 0.84 -0.03 0.02
## HalfBath 0.00 2.00 2.00 1.09 -0.01 0.02
## BedroomAbvGr 0.00 8.00 8.00 0.50 3.90 0.03
## KitchenAbvGr 0.00 2.00 2.00 3.47 11.93 0.01
## KitchenQual* 1.00 4.00 3.00 -2.35 6.08 0.02
## TotRmsAbvGrd 3.00 14.00 11.00 1.10 2.46 0.06
## Functional* 1.00 7.00 6.00 -3.19 9.38 0.04
## Fireplaces 0.00 3.00 3.00 0.85 -0.18 0.03
## FireplaceQu* 1.00 5.00 4.00 -0.31 -1.38 0.06
## GarageType* 1.00 6.00 5.00 0.68 -1.47 0.08
## GarageYrBlt 1906.00 2009.00 103.00 -0.27 0.74 0.62
## GarageFinish* 1.00 3.00 2.00 -0.94 -0.50 0.03
## GarageCars 0.00 4.00 4.00 -0.27 0.29 0.03
## GarageArea 0.00 1356.00 1356.00 0.22 1.67 7.63
## GarageQual* 1.00 4.00 3.00 -5.56 31.09 0.01
## GarageCond* 1.00 5.00 4.00 -5.14 25.63 0.02
## PavedDrive* 1.00 3.00 2.00 -3.78 12.70 0.02
## WoodDeckSF 0.00 857.00 857.00 1.90 4.41 5.21
## OpenPorchSF 0.00 523.00 523.00 2.88 10.94 2.53
## EnclosedPorch 0.00 318.00 318.00 2.78 6.72 2.35
## X3SsnPorch 0.00 407.00 407.00 9.55 102.06 1.12
## ScreenPorch 0.00 440.00 440.00 3.43 12.08 2.41
## PoolArea 0.00 738.00 738.00 12.77 164.07 1.97
## PoolQC* 1.00 2.00 1.00 0.00 -2.44 0.29
## Fence* 1.00 4.00 3.00 -0.62 -0.58 0.06
## MiscFeature* 1.00 4.00 3.00 -2.44 6.96 0.09
## MiscVal 0.00 15500.00 15500.00 18.00 357.03 28.59
## MoSold 1.00 12.00 11.00 0.21 -0.42 0.11
## YrSold 2006.00 2010.00 4.00 0.07 -1.22 0.05
## SaleType* 1.00 8.00 7.00 -3.33 9.25 0.07
## SaleCondition* 1.00 6.00 5.00 -2.65 5.23 0.05
## SalePrice 34900.00 402861.00 367961.00 1.52 4.27 1918.04
## log_SalePrice 10.46 12.91 2.45 -0.12 2.05 0.01
## ------------------------------------------------------------
## group: PConc
## vars n mean sd median trimmed mad
## Id 1 647 727.94 418.74 732.00 727.53 529.29
## MSSubClass 2 647 60.26 41.01 60.00 54.51 59.30
## MSZoning* 3 647 2.88 0.69 3.00 2.99 0.00
## LotFrontage 4 542 71.70 25.58 70.00 70.38 19.27
## LotArea 5 647 10139.60 5585.26 9591.00 9668.80 3100.12
## Street* 6 647 1.00 0.00 1.00 1.00 0.00
## Alley* 7 34 1.79 0.41 2.00 1.86 0.00
## LotShape* 8 647 2.73 1.44 4.00 2.78 0.00
## LandContour* 9 647 3.82 0.62 4.00 4.00 0.00
## Utilities* 10 647 1.00 0.00 1.00 1.00 0.00
## LotConfig* 11 647 4.05 1.58 5.00 4.30 0.00
## LandSlope* 12 647 1.04 0.22 1.00 1.00 0.00
## Neighborhood* 13 647 10.84 5.97 12.00 10.70 8.90
## Condition1* 14 647 3.13 0.86 3.00 3.00 0.00
## Condition2* 15 647 2.00 0.07 2.00 2.00 0.00
## BldgType* 16 647 1.66 1.45 1.00 1.34 0.00
## HouseStyle* 17 647 4.44 1.71 3.00 4.44 2.97
## OverallQual 18 647 6.98 1.24 7.00 7.00 1.48
## OverallCond 19 647 5.20 0.68 5.00 5.04 0.00
## YearBuilt 20 647 1993.31 23.09 2002.00 1999.26 5.93
## YearRemodAdd 21 647 1998.05 13.59 2003.00 2001.29 5.93
## RoofStyle* 22 647 2.41 0.81 2.00 2.26 0.00
## RoofMatl* 23 647 2.01 0.20 2.00 2.00 0.00
## Exterior1st* 24 647 9.56 2.69 11.00 10.02 0.00
## Exterior2nd* 25 647 10.32 3.06 12.00 10.80 0.00
## MasVnrType* 26 639 2.81 0.70 3.00 2.77 1.48
## MasVnrArea 27 639 143.21 217.66 30.00 97.11 44.48
## ExterQual* 28 647 3.14 0.74 3.00 3.23 0.00
## ExterCond* 29 647 2.94 0.26 3.00 3.00 0.00
## Foundation* 30 647 1.00 0.00 1.00 1.00 0.00
## BsmtQual* 31 644 2.73 0.88 3.00 2.79 0.00
## BsmtCond* 32 644 2.92 0.29 3.00 3.00 0.00
## BsmtExposure* 33 643 3.02 1.25 4.00 3.15 0.00
## BsmtFinType1* 34 644 3.94 1.64 3.00 4.02 0.00
## BsmtFinSF1 35 647 492.05 544.04 405.00 425.76 600.45
## BsmtFinType2* 36 643 5.88 0.63 6.00 6.00 0.00
## BsmtFinSF2 37 647 21.32 119.74 0.00 0.00 0.00
## BsmtUnfSF 38 647 695.33 493.69 600.00 652.56 480.36
## TotalBsmtSF 39 647 1208.70 478.74 1151.00 1177.59 459.61
## Heating* 40 647 1.00 0.00 1.00 1.00 0.00
## HeatingQC* 41 647 1.46 0.97 1.00 1.23 0.00
## CentralAir* 42 647 1.99 0.11 2.00 2.00 0.00
## Electrical* 43 646 3.96 0.34 4.00 4.00 0.00
## X1stFlrSF 44 647 1248.05 428.38 1199.00 1215.09 446.26
## X2ndFlrSF 45 647 436.88 477.50 0.00 387.45 0.00
## LowQualFinSF 46 647 2.93 33.03 0.00 0.00 0.00
## GrLivArea 47 647 1687.86 520.92 1626.00 1641.11 410.68
## BsmtFullBath 48 647 0.47 0.51 0.00 0.46 0.00
## BsmtHalfBath 49 647 0.03 0.19 0.00 0.00 0.00
## FullBath 50 647 1.85 0.44 2.00 1.90 0.00
## HalfBath 51 647 0.49 0.51 0.00 0.48 0.00
## BedroomAbvGr 52 647 2.84 0.77 3.00 2.85 0.00
## KitchenAbvGr 53 647 1.01 0.12 1.00 1.00 0.00
## KitchenQual* 54 647 2.92 0.85 3.00 3.03 0.00
## TotRmsAbvGrd 55 647 6.87 1.59 7.00 6.80 1.48
## Functional* 56 647 4.94 0.43 5.00 5.00 0.00
## Fireplaces 57 647 0.69 0.59 1.00 0.66 0.00
## FireplaceQu* 58 404 3.65 1.15 3.00 3.71 0.00
## GarageType* 59 633 1.61 1.10 1.00 1.38 0.00
## GarageYrBlt 60 633 1996.24 16.97 2002.00 2000.22 5.93
## GarageFinish* 61 633 1.76 0.74 2.00 1.70 1.48
## GarageCars 62 647 2.15 0.62 2.00 2.19 0.00
## GarageArea 63 647 566.15 196.96 539.00 562.04 161.60
## GarageQual* 64 633 2.98 0.18 3.00 3.00 0.00
## GarageCond* 65 633 3.98 0.24 4.00 4.00 0.00
## PavedDrive* 66 647 2.95 0.31 3.00 3.00 0.00
## WoodDeckSF 67 647 118.43 119.68 120.00 102.27 177.91
## OpenPorchSF 68 647 63.61 61.88 48.00 54.41 47.44
## EnclosedPorch 69 647 10.24 44.99 0.00 0.00 0.00
## X3SsnPorch 70 647 3.70 31.21 0.00 0.00 0.00
## ScreenPorch 71 647 11.39 48.83 0.00 0.00 0.00
## PoolArea 72 647 2.39 35.12 0.00 0.00 0.00
## PoolQC* 73 3 1.33 0.58 1.00 1.33 0.00
## Fence* 74 51 2.18 0.91 3.00 2.22 0.00
## MiscFeature* 75 8 1.00 0.00 1.00 1.00 0.00
## MiscVal 76 647 9.40 102.85 0.00 0.00 0.00
## MoSold 77 647 6.38 2.73 6.00 6.30 2.97
## YrSold 78 647 2007.77 1.31 2008.00 2007.71 1.48
## SaleType* 79 647 8.53 1.06 9.00 8.74 0.00
## SaleCondition* 80 647 4.99 0.97 5.00 5.11 0.00
## SalePrice 81 647 225230.44 86865.98 205000.00 214800.87 62417.46
## log_SalePrice 82 647 12.26 0.35 12.23 12.25 0.31
## min max range skew kurtosis se
## Id 1.00 1456.00 1455.00 0.01 -1.19 16.46
## MSSubClass 20.00 190.00 170.00 1.08 0.55 1.61
## MSZoning* 1.00 4.00 3.00 -1.65 3.05 0.03
## LotFrontage 24.00 313.00 289.00 2.11 15.31 1.10
## LotArea 2117.00 63887.00 61770.00 3.92 27.87 219.58
## Street* 1.00 1.00 0.00 NaN NaN 0.00
## Alley* 1.00 2.00 1.00 -1.39 -0.06 0.07
## LotShape* 1.00 4.00 3.00 -0.29 -1.87 0.06
## LandContour* 1.00 4.00 3.00 -3.37 10.25 0.02
## Utilities* 1.00 1.00 0.00 NaN NaN 0.00
## LotConfig* 1.00 5.00 4.00 -1.16 -0.47 0.06
## LandSlope* 1.00 3.00 2.00 5.78 36.25 0.01
## Neighborhood* 1.00 22.00 21.00 0.11 -1.36 0.23
## Condition1* 1.00 9.00 8.00 4.22 21.10 0.03
## Condition2* 1.00 3.00 2.00 4.82 211.78 0.00
## BldgType* 1.00 5.00 4.00 1.76 1.19 0.06
## HouseStyle* 1.00 8.00 7.00 0.04 -1.29 0.07
## OverallQual 3.00 10.00 7.00 -0.15 0.12 0.05
## OverallCond 2.00 9.00 7.00 2.35 8.78 0.03
## YearBuilt 1885.00 2010.00 125.00 -2.47 5.33 0.91
## YearRemodAdd 1950.00 2010.00 60.00 -2.34 4.99 0.53
## RoofStyle* 1.00 5.00 4.00 1.45 0.20 0.03
## RoofMatl* 1.00 5.00 4.00 12.67 182.05 0.01
## Exterior1st* 1.00 13.00 12.00 -1.38 0.55 0.11
## Exterior2nd* 1.00 14.00 13.00 -1.25 0.09 0.12
## MasVnrType* 1.00 4.00 3.00 0.26 -0.93 0.03
## MasVnrArea 0.00 1600.00 1600.00 2.32 7.19 8.61
## ExterQual* 1.00 4.00 3.00 -1.27 2.44 0.03
## ExterCond* 1.00 3.00 2.00 -4.25 18.90 0.01
## Foundation* 1.00 1.00 0.00 NaN NaN 0.00
## BsmtQual* 1.00 4.00 3.00 -1.05 0.19 0.03
## BsmtCond* 1.00 3.00 2.00 -3.94 16.16 0.01
## BsmtExposure* 1.00 4.00 3.00 -0.70 -1.26 0.05
## BsmtFinType1* 1.00 6.00 5.00 0.15 -1.28 0.06
## BsmtFinSF1 0.00 5644.00 5644.00 1.82 11.25 21.39
## BsmtFinType2* 1.00 6.00 5.00 -5.98 37.13 0.02
## BsmtFinSF2 0.00 1127.00 1127.00 6.66 47.60 4.71
## BsmtUnfSF 0.00 2336.00 2336.00 0.70 -0.29 19.41
## TotalBsmtSF 0.00 6110.00 6110.00 2.21 17.29 18.82
## Heating* 1.00 1.00 0.00 NaN NaN 0.00
## HeatingQC* 1.00 4.00 3.00 1.76 1.41 0.04
## CentralAir* 1.00 2.00 1.00 -8.80 75.64 0.00
## Electrical* 1.00 4.00 3.00 -8.01 63.75 0.01
## X1stFlrSF 520.00 4692.00 4172.00 1.46 6.77 16.84
## X2ndFlrSF 0.00 2065.00 2065.00 0.48 -1.08 18.77
## LowQualFinSF 0.00 481.00 481.00 12.34 155.17 1.30
## GrLivArea 672.00 5642.00 4970.00 1.86 8.46 20.48
## BsmtFullBath 0.00 2.00 2.00 0.28 -1.52 0.02
## BsmtHalfBath 0.00 2.00 2.00 5.79 35.60 0.01
## FullBath 0.00 3.00 3.00 -0.93 1.81 0.02
## HalfBath 0.00 2.00 2.00 0.12 -1.81 0.02
## BedroomAbvGr 0.00 6.00 6.00 -0.26 1.10 0.03
## KitchenAbvGr 1.00 3.00 2.00 11.10 137.25 0.00
## KitchenQual* 1.00 4.00 3.00 -1.13 0.94 0.03
## TotRmsAbvGrd 3.00 12.00 9.00 0.39 0.39 0.06
## Functional* 1.00 5.00 4.00 -8.10 66.90 0.02
## Fireplaces 0.00 3.00 3.00 0.30 -0.14 0.02
## FireplaceQu* 1.00 5.00 4.00 -0.14 -0.75 0.06
## GarageType* 1.00 4.00 3.00 1.38 0.14 0.04
## GarageYrBlt 1910.00 2010.00 100.00 -2.70 7.67 0.67
## GarageFinish* 1.00 3.00 2.00 0.41 -1.07 0.03
## GarageCars 0.00 4.00 4.00 -0.61 1.85 0.02
## GarageArea 0.00 1418.00 1418.00 0.23 1.46 7.74
## GarageQual* 1.00 3.00 2.00 -9.50 93.90 0.01
## GarageCond* 1.00 4.00 3.00 -11.26 129.48 0.01
## PavedDrive* 1.00 3.00 2.00 -5.89 33.68 0.01
## WoodDeckSF 0.00 668.00 668.00 1.08 1.59 4.71
## OpenPorchSF 0.00 406.00 406.00 1.50 2.97 2.43
## EnclosedPorch 0.00 552.00 552.00 5.71 43.05 1.77
## X3SsnPorch 0.00 508.00 508.00 10.57 132.73 1.23
## ScreenPorch 0.00 396.00 396.00 4.54 21.31 1.92
## PoolArea 0.00 555.00 555.00 14.63 213.08 1.38
## PoolQC* 1.00 2.00 1.00 0.38 -2.33 0.33
## Fence* 1.00 3.00 2.00 -0.34 -1.73 0.13
## MiscFeature* 1.00 1.00 0.00 NaN NaN 0.00
## MiscVal 0.00 2000.00 2000.00 14.64 247.23 4.04
## MoSold 1.00 12.00 11.00 0.21 -0.45 0.11
## YrSold 2006.00 2010.00 4.00 0.15 -1.16 0.05
## SaleType* 1.00 9.00 8.00 -3.12 13.83 0.04
## SaleCondition* 1.00 6.00 5.00 -2.95 10.06 0.04
## SalePrice 78000.00 755000.00 677000.00 1.79 5.73 3415.05
## log_SalePrice 11.26 13.53 2.27 0.28 0.58 0.01
## ------------------------------------------------------------
## group: Slab
## vars n mean sd median trimmed mad min
## Id 1 24 781.67 400.82 882.00 800.25 469.24 18.00
## MSSubClass 2 24 63.12 42.16 72.50 59.75 29.65 20.00
## MSZoning* 3 24 2.08 0.41 2.00 2.05 0.00 1.00
## LotFrontage 4 19 65.21 11.32 64.00 64.06 5.93 50.00
## LotArea 5 24 9117.62 3554.16 8369.50 8585.55 2003.73 5000.00
## Street* 6 24 1.00 0.00 1.00 1.00 0.00 1.00
## Alley* 7 0 NaN NA NA NaN NA Inf
## LotShape* 8 24 2.71 0.69 3.00 2.85 0.00 1.00
## LandContour* 9 24 2.79 0.59 3.00 2.95 0.00 1.00
## Utilities* 10 24 1.00 0.00 1.00 1.00 0.00 1.00
## LotConfig* 11 24 3.17 1.27 4.00 3.30 0.00 1.00
## LandSlope* 12 24 1.04 0.20 1.00 1.00 0.00 1.00
## Neighborhood* 13 24 4.75 2.47 6.00 4.80 2.97 1.00
## Condition1* 14 24 1.92 0.41 2.00 1.95 0.00 1.00
## Condition2* 15 24 1.00 0.00 1.00 1.00 0.00 1.00
## BldgType* 16 24 1.88 0.99 1.00 1.85 0.00 1.00
## HouseStyle* 17 24 2.08 0.65 2.00 2.05 0.00 1.00
## OverallQual 18 24 4.29 1.20 4.00 4.30 1.48 1.00
## OverallCond 19 24 4.75 1.07 5.00 4.70 0.00 3.00
## YearBuilt 20 24 1959.58 15.92 1955.00 1958.65 10.38 1930.00
## YearRemodAdd 21 24 1965.17 19.02 1956.00 1962.55 8.90 1950.00
## RoofStyle* 22 24 2.08 0.41 2.00 2.05 0.00 1.00
## RoofMatl* 23 24 1.04 0.20 1.00 1.00 0.00 1.00
## Exterior1st* 24 24 5.50 2.47 5.50 5.50 3.71 1.00
## Exterior2nd* 25 24 6.67 2.97 6.00 6.75 3.71 1.00
## MasVnrType* 26 24 1.83 0.38 2.00 1.90 0.00 1.00
## MasVnrArea 27 24 51.46 133.53 0.00 19.75 0.00 0.00
## ExterQual* 28 24 2.79 0.59 3.00 2.95 0.00 1.00
## ExterCond* 29 24 2.62 0.77 3.00 2.75 0.00 1.00
## Foundation* 30 24 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtQual* 31 0 NaN NA NA NaN NA Inf
## BsmtCond* 32 0 NaN NA NA NaN NA Inf
## BsmtExposure* 33 0 NaN NA NA NaN NA Inf
## BsmtFinType1* 34 0 NaN NA NA NaN NA Inf
## BsmtFinSF1 35 24 0.00 0.00 0.00 0.00 0.00 0.00
## BsmtFinType2* 36 0 NaN NA NA NaN NA Inf
## BsmtFinSF2 37 24 0.00 0.00 0.00 0.00 0.00 0.00
## BsmtUnfSF 38 24 0.00 0.00 0.00 0.00 0.00 0.00
## TotalBsmtSF 39 24 0.00 0.00 0.00 0.00 0.00 0.00
## Heating* 40 24 1.38 0.77 1.00 1.25 0.00 1.00
## HeatingQC* 41 24 2.88 1.19 3.00 2.95 1.48 1.00
## CentralAir* 42 24 1.62 0.49 2.00 1.65 0.00 1.00
## Electrical* 43 24 2.50 0.78 3.00 2.60 0.00 1.00
## X1stFlrSF 44 24 1118.42 430.45 1064.00 1117.25 347.67 334.00
## X2ndFlrSF 45 24 218.83 404.93 0.00 135.25 0.00 0.00
## LowQualFinSF 46 24 2.21 10.82 0.00 0.00 0.00 0.00
## GrLivArea 47 24 1339.46 506.17 1174.00 1321.30 501.12 334.00
## BsmtFullBath 48 24 0.00 0.00 0.00 0.00 0.00 0.00
## BsmtHalfBath 49 24 0.00 0.00 0.00 0.00 0.00 0.00
## FullBath 50 24 1.67 0.56 2.00 1.65 0.00 1.00
## HalfBath 51 24 0.00 0.00 0.00 0.00 0.00 0.00
## BedroomAbvGr 52 24 2.92 1.14 3.00 2.85 1.48 1.00
## KitchenAbvGr 53 24 1.46 0.51 1.00 1.45 0.00 1.00
## KitchenQual* 54 24 2.71 0.69 3.00 2.85 0.00 1.00
## TotRmsAbvGrd 55 24 6.50 2.28 6.00 6.45 2.22 2.00
## Functional* 56 24 3.46 0.98 4.00 3.65 0.00 1.00
## Fireplaces 57 24 0.33 0.56 0.00 0.25 0.00 0.00
## FireplaceQu* 58 7 2.86 1.21 3.00 2.86 1.48 1.00
## GarageType* 59 20 2.75 1.41 3.50 2.81 0.74 1.00
## GarageYrBlt 60 20 1967.20 14.89 1964.50 1966.25 19.27 1945.00
## GarageFinish* 61 20 1.90 0.31 2.00 2.00 0.00 1.00
## GarageCars 62 24 1.50 0.78 2.00 1.60 0.00 0.00
## GarageArea 63 24 375.04 203.45 405.00 382.85 167.53 0.00
## GarageQual* 64 20 1.00 0.00 1.00 1.00 0.00 1.00
## GarageCond* 65 20 1.95 0.22 2.00 2.00 0.00 1.00
## PavedDrive* 66 24 2.46 0.88 3.00 2.55 0.00 1.00
## WoodDeckSF 67 24 23.00 54.57 0.00 10.60 0.00 0.00
## OpenPorchSF 68 24 8.71 30.25 0.00 1.45 0.00 0.00
## EnclosedPorch 69 24 23.00 62.19 0.00 8.85 0.00 0.00
## X3SsnPorch 70 24 0.00 0.00 0.00 0.00 0.00 0.00
## ScreenPorch 71 24 0.00 0.00 0.00 0.00 0.00 0.00
## PoolArea 72 24 0.00 0.00 0.00 0.00 0.00 0.00
## PoolQC* 73 0 NaN NA NA NaN NA Inf
## Fence* 74 2 1.50 0.71 1.50 1.50 0.74 1.00
## MiscFeature* 75 3 1.67 0.58 2.00 1.67 0.00 1.00
## MiscVal 76 24 216.67 746.39 0.00 25.00 0.00 0.00
## MoSold 77 24 5.83 2.46 6.00 5.85 1.48 1.00
## YrSold 78 24 2008.04 1.49 2009.00 2008.05 1.48 2006.00
## SaleType* 79 24 1.96 0.20 2.00 2.00 0.00 1.00
## SaleCondition* 80 24 1.88 0.34 2.00 1.95 0.00 1.00
## SalePrice 81 24 107365.62 34213.98 104150.00 105748.75 21884.66 39300.00
## log_SalePrice 82 24 11.53 0.34 11.55 11.55 0.22 10.58
## max range skew kurtosis se
## Id 1413.0 1395.00 -0.43 -1.06 81.82
## MSSubClass 190.0 170.00 0.85 0.90 8.61
## MSZoning* 3.0 2.00 0.63 2.24 0.08
## LotFrontage 100.0 50.00 1.27 2.25 2.60
## LotArea 21750.0 16750.00 1.96 4.20 725.49
## Street* 1.0 0.00 NaN NaN 0.00
## Alley* -Inf -Inf NA NA NA
## LotShape* 3.0 2.00 -1.88 1.76 0.14
## LandContour* 3.0 2.00 -2.42 4.32 0.12
## Utilities* 1.0 0.00 NaN NaN 0.00
## LotConfig* 4.0 3.00 -0.90 -1.08 0.26
## LandSlope* 2.0 1.00 4.30 17.24 0.04
## Neighborhood* 8.0 7.00 -0.17 -1.59 0.50
## Condition1* 3.0 2.00 -0.63 2.24 0.08
## Condition2* 1.0 0.00 NaN NaN 0.00
## BldgType* 3.0 2.00 0.24 -1.98 0.20
## HouseStyle* 4.0 3.00 0.82 1.50 0.13
## OverallQual 7.0 6.00 -0.40 0.92 0.24
## OverallCond 7.0 4.00 0.08 -0.11 0.22
## YearBuilt 2003.0 73.00 0.78 0.32 3.25
## YearRemodAdd 2007.0 57.00 1.03 -0.26 3.88
## RoofStyle* 3.0 2.00 0.63 2.24 0.08
## RoofMatl* 2.0 1.00 4.30 17.24 0.04
## Exterior1st* 10.0 9.00 0.09 -1.19 0.50
## Exterior2nd* 11.0 10.00 -0.11 -1.30 0.61
## MasVnrType* 2.0 1.00 -1.68 0.86 0.08
## MasVnrArea 500.0 500.00 2.29 3.91 27.26
## ExterQual* 3.0 2.00 -2.42 4.32 0.12
## ExterCond* 3.0 2.00 -1.50 0.37 0.16
## Foundation* 1.0 0.00 NaN NaN 0.00
## BsmtQual* -Inf -Inf NA NA NA
## BsmtCond* -Inf -Inf NA NA NA
## BsmtExposure* -Inf -Inf NA NA NA
## BsmtFinType1* -Inf -Inf NA NA NA
## BsmtFinSF1 0.0 0.00 NaN NaN 0.00
## BsmtFinType2* -Inf -Inf NA NA NA
## BsmtFinSF2 0.0 0.00 NaN NaN 0.00
## BsmtUnfSF 0.0 0.00 NaN NaN 0.00
## TotalBsmtSF 0.0 0.00 NaN NaN 0.00
## Heating* 3.0 2.00 1.50 0.37 0.16
## HeatingQC* 4.0 3.00 -0.36 -1.54 0.24
## CentralAir* 2.0 1.00 -0.48 -1.84 0.10
## Electrical* 3.0 2.00 -1.05 -0.58 0.16
## X1stFlrSF 2020.0 1686.00 0.10 -0.60 87.87
## X2ndFlrSF 1427.0 1427.00 1.65 1.61 82.66
## LowQualFinSF 53.0 53.00 4.30 17.24 2.21
## GrLivArea 2320.0 1986.00 0.28 -0.86 103.32
## BsmtFullBath 0.0 0.00 NaN NaN 0.00
## BsmtHalfBath 0.0 0.00 NaN NaN 0.00
## FullBath 3.0 2.00 0.05 -0.91 0.12
## HalfBath 0.0 0.00 NaN NaN 0.00
## BedroomAbvGr 6.0 5.00 0.66 -0.01 0.23
## KitchenAbvGr 2.0 1.00 0.16 -2.06 0.10
## KitchenQual* 3.0 2.00 -1.88 1.76 0.14
## TotRmsAbvGrd 12.0 10.00 0.26 -0.21 0.47
## Functional* 4.0 3.00 -1.50 0.83 0.20
## Fireplaces 2.0 2.00 1.34 0.73 0.12
## FireplaceQu* 4.0 3.00 -0.25 -1.81 0.46
## GarageType* 4.0 3.00 -0.33 -1.86 0.32
## GarageYrBlt 2003.0 58.00 0.51 -0.55 3.33
## GarageFinish* 2.0 1.00 -2.47 4.32 0.07
## GarageCars 2.0 2.00 -1.05 -0.58 0.16
## GarageArea 672.0 672.00 -0.64 -0.53 41.53
## GarageQual* 1.0 0.00 NaN NaN 0.00
## GarageCond* 2.0 1.00 -3.82 13.29 0.05
## PavedDrive* 3.0 2.00 -0.97 -1.04 0.18
## WoodDeckSF 186.0 186.00 1.94 2.25 11.14
## OpenPorchSF 144.0 144.00 3.75 13.70 6.18
## EnclosedPorch 190.0 190.00 2.13 2.67 12.69
## X3SsnPorch 0.0 0.00 NaN NaN 0.00
## ScreenPorch 0.0 0.00 NaN NaN 0.00
## PoolArea 0.0 0.00 NaN NaN 0.00
## PoolQC* -Inf -Inf NA NA NA
## Fence* 2.0 1.00 0.00 -2.75 0.50
## MiscFeature* 2.0 1.00 -0.38 -2.33 0.33
## MiscVal 3500.0 3500.00 3.62 12.73 152.36
## MoSold 11.0 10.00 0.05 -0.13 0.50
## YrSold 2010.0 4.00 -0.30 -1.62 0.30
## SaleType* 2.0 1.00 -4.30 17.24 0.04
## SaleCondition* 2.0 1.00 -2.13 2.64 0.07
## SalePrice 198500.0 159200.00 0.61 0.61 6983.90
## log_SalePrice 12.2 1.62 -0.65 1.13 0.07
## ------------------------------------------------------------
## group: Stone
## vars n mean sd median trimmed mad min
## Id 1 6 888.50 436.03 810.50 888.50 430.70 247.00
## MSSubClass 2 6 78.33 58.11 70.00 78.33 14.83 20.00
## MSZoning* 3 6 2.33 0.82 2.50 2.33 0.74 1.00
## LotFrontage 4 6 66.67 4.63 66.00 66.67 2.97 60.00
## LotArea 5 6 9014.67 1622.67 8967.00 9014.67 318.76 6600.00
## Street* 6 6 1.00 0.00 1.00 1.00 0.00 1.00
## Alley* 7 3 1.67 0.58 2.00 1.67 0.00 1.00
## LotShape* 8 6 1.83 0.41 2.00 1.83 0.00 1.00
## LandContour* 9 6 1.83 0.41 2.00 1.83 0.00 1.00
## Utilities* 10 6 1.00 0.00 1.00 1.00 0.00 1.00
## LotConfig* 11 6 1.50 0.55 1.50 1.50 0.74 1.00
## LandSlope* 12 6 1.17 0.41 1.00 1.17 0.00 1.00
## Neighborhood* 13 6 3.00 1.26 3.50 3.00 0.74 1.00
## Condition1* 14 6 1.00 0.00 1.00 1.00 0.00 1.00
## Condition2* 15 6 1.00 0.00 1.00 1.00 0.00 1.00
## BldgType* 16 6 1.17 0.41 1.00 1.17 0.00 1.00
## HouseStyle* 17 6 2.50 0.84 3.00 2.50 0.00 1.00
## OverallQual 18 6 5.67 1.21 5.50 5.67 1.48 4.00
## OverallCond 19 6 7.00 1.67 7.00 7.00 0.74 4.00
## YearBuilt 20 6 1912.67 28.61 1905.00 1912.67 28.17 1880.00
## YearRemodAdd 21 6 1978.33 26.34 1980.50 1978.33 35.58 1950.00
## RoofStyle* 22 6 1.17 0.41 1.00 1.17 0.00 1.00
## RoofMatl* 23 6 1.00 0.00 1.00 1.00 0.00 1.00
## Exterior1st* 24 6 3.50 1.87 3.50 3.50 2.22 1.00
## Exterior2nd* 25 6 3.50 1.87 3.50 3.50 2.22 1.00
## MasVnrType* 26 6 1.00 0.00 1.00 1.00 0.00 1.00
## MasVnrArea 27 6 0.00 0.00 0.00 0.00 0.00 0.00
## ExterQual* 28 6 2.33 0.82 2.50 2.33 0.74 1.00
## ExterCond* 29 6 2.50 0.84 3.00 2.50 0.00 1.00
## Foundation* 30 6 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtQual* 31 6 1.83 0.41 2.00 1.83 0.00 1.00
## BsmtCond* 32 6 2.50 0.84 3.00 2.50 0.00 1.00
## BsmtExposure* 33 6 2.50 0.84 3.00 2.50 0.00 1.00
## BsmtFinType1* 34 6 1.83 0.41 2.00 1.83 0.00 1.00
## BsmtFinSF1 35 6 45.83 112.27 0.00 45.83 0.00 0.00
## BsmtFinType2* 36 6 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtFinSF2 37 6 0.00 0.00 0.00 0.00 0.00 0.00
## BsmtUnfSF 38 6 849.17 389.25 935.50 849.17 119.35 105.00
## TotalBsmtSF 39 6 895.00 408.88 1007.00 895.00 217.20 105.00
## Heating* 40 6 1.17 0.41 1.00 1.17 0.00 1.00
## HeatingQC* 41 6 2.17 0.75 2.00 2.17 0.74 1.00
## CentralAir* 42 6 1.50 0.55 1.50 1.50 0.74 1.00
## Electrical* 43 6 1.83 0.41 2.00 1.83 0.00 1.00
## X1stFlrSF 44 6 1093.83 229.89 1049.00 1093.83 245.37 859.00
## X2ndFlrSF 45 6 800.83 519.94 1007.00 800.83 339.52 0.00
## LowQualFinSF 46 6 0.00 0.00 0.00 0.00 0.00 0.00
## GrLivArea 47 6 1894.67 702.28 2134.00 1894.67 551.53 910.00
## BsmtFullBath 48 6 0.00 0.00 0.00 0.00 0.00 0.00
## BsmtHalfBath 49 6 0.00 0.00 0.00 0.00 0.00 0.00
## FullBath 50 6 1.50 0.55 1.50 1.50 0.74 1.00
## HalfBath 51 6 0.17 0.41 0.00 0.17 0.00 0.00
## BedroomAbvGr 52 6 3.50 0.84 4.00 3.50 0.00 2.00
## KitchenAbvGr 53 6 1.33 0.52 1.00 1.33 0.00 1.00
## KitchenQual* 54 6 2.17 0.75 2.00 2.17 0.74 1.00
## TotRmsAbvGrd 55 6 8.17 2.04 8.50 8.17 1.48 5.00
## Functional* 56 6 1.83 0.41 2.00 1.83 0.00 1.00
## Fireplaces 57 6 0.50 0.84 0.00 0.50 0.00 0.00
## FireplaceQu* 58 2 1.00 0.00 1.00 1.00 0.00 1.00
## GarageType* 59 6 1.50 0.55 1.50 1.50 0.74 1.00
## GarageYrBlt 60 6 1950.50 24.94 1951.50 1950.50 17.05 1910.00
## GarageFinish* 61 6 1.50 0.55 1.50 1.50 0.74 1.00
## GarageCars 62 6 1.67 1.21 1.00 1.67 0.00 1.00
## GarageArea 63 6 464.33 207.58 423.00 464.33 41.51 252.00
## GarageQual* 64 6 1.83 0.41 2.00 1.83 0.00 1.00
## GarageCond* 65 6 1.83 0.41 2.00 1.83 0.00 1.00
## PavedDrive* 66 6 1.67 0.52 2.00 1.67 0.00 1.00
## WoodDeckSF 67 6 74.17 92.52 34.00 74.17 50.41 0.00
## OpenPorchSF 68 6 67.83 111.32 30.00 67.83 44.48 0.00
## EnclosedPorch 69 6 124.33 142.05 105.00 124.33 111.19 0.00
## X3SsnPorch 70 6 0.00 0.00 0.00 0.00 0.00 0.00
## ScreenPorch 71 6 0.00 0.00 0.00 0.00 0.00 0.00
## PoolArea 72 6 0.00 0.00 0.00 0.00 0.00 0.00
## PoolQC* 73 0 NaN NA NA NaN NA Inf
## Fence* 74 2 1.50 0.71 1.50 1.50 0.74 1.00
## MiscFeature* 75 1 1.00 NA 1.00 1.00 0.00 1.00
## MiscVal 76 6 416.67 1020.62 0.00 416.67 0.00 0.00
## MoSold 77 6 6.17 4.07 5.00 6.17 3.71 1.00
## YrSold 78 6 2008.67 1.51 2009.00 2008.67 1.48 2006.00
## SaleType* 79 6 1.00 0.00 1.00 1.00 0.00 1.00
## SaleCondition* 80 6 1.83 0.41 2.00 1.83 0.00 1.00
## SalePrice 81 6 165959.17 78557.70 126500.00 165959.17 31671.30 102776.00
## log_SalePrice 82 6 11.93 0.44 11.74 11.93 0.27 11.54
## max range skew kurtosis se
## Id 1458.00 1211.00 -0.04 -1.60 178.01
## MSSubClass 190.00 170.00 0.99 -0.55 23.72
## MSZoning* 3.00 2.00 -0.48 -1.58 0.33
## LotFrontage 74.00 14.00 0.18 -1.23 1.89
## LotArea 11700.00 5100.00 0.21 -0.93 662.45
## Street* 1.00 0.00 NaN NaN 0.00
## Alley* 2.00 1.00 -0.38 -2.33 0.33
## LotShape* 2.00 1.00 -1.36 -0.08 0.17
## LandContour* 2.00 1.00 -1.36 -0.08 0.17
## Utilities* 1.00 0.00 NaN NaN 0.00
## LotConfig* 2.00 1.00 0.00 -2.31 0.22
## LandSlope* 2.00 1.00 1.36 -0.08 0.17
## Neighborhood* 4.00 3.00 -0.49 -1.70 0.52
## Condition1* 1.00 0.00 NaN NaN 0.00
## Condition2* 1.00 0.00 NaN NaN 0.00
## BldgType* 2.00 1.00 1.36 -0.08 0.17
## HouseStyle* 3.00 2.00 -0.85 -1.17 0.34
## OverallQual 7.00 3.00 -0.04 -1.88 0.49
## OverallCond 9.00 5.00 -0.64 -0.92 0.68
## YearBuilt 1953.00 73.00 0.30 -1.85 11.68
## YearRemodAdd 2006.00 56.00 -0.06 -2.18 10.75
## RoofStyle* 2.00 1.00 1.36 -0.08 0.17
## RoofMatl* 1.00 0.00 NaN NaN 0.00
## Exterior1st* 6.00 5.00 0.00 -1.80 0.76
## Exterior2nd* 6.00 5.00 0.00 -1.80 0.76
## MasVnrType* 1.00 0.00 NaN NaN 0.00
## MasVnrArea 0.00 0.00 NaN NaN 0.00
## ExterQual* 3.00 2.00 -0.48 -1.58 0.33
## ExterCond* 3.00 2.00 -0.85 -1.17 0.34
## Foundation* 1.00 0.00 NaN NaN 0.00
## BsmtQual* 2.00 1.00 -1.36 -0.08 0.17
## BsmtCond* 3.00 2.00 -0.85 -1.17 0.34
## BsmtExposure* 3.00 2.00 -0.85 -1.17 0.34
## BsmtFinType1* 2.00 1.00 -1.36 -0.08 0.17
## BsmtFinSF1 275.00 275.00 1.36 -0.08 45.83
## BsmtFinType2* 1.00 0.00 NaN NaN 0.00
## BsmtFinSF2 0.00 0.00 NaN NaN 0.00
## BsmtUnfSF 1240.00 1135.00 -0.97 -0.59 158.91
## TotalBsmtSF 1240.00 1135.00 -1.05 -0.56 166.92
## Heating* 2.00 1.00 1.36 -0.08 0.17
## HeatingQC* 3.00 2.00 -0.17 -1.54 0.31
## CentralAir* 2.00 1.00 0.00 -2.31 0.22
## Electrical* 2.00 1.00 -1.36 -0.08 0.17
## X1stFlrSF 1378.00 519.00 0.13 -2.13 93.85
## X2ndFlrSF 1320.00 1320.00 -0.50 -1.73 212.27
## LowQualFinSF 0.00 0.00 NaN NaN 0.00
## GrLivArea 2640.00 1730.00 -0.34 -1.90 286.70
## BsmtFullBath 0.00 0.00 NaN NaN 0.00
## BsmtHalfBath 0.00 0.00 NaN NaN 0.00
## FullBath 2.00 1.00 0.00 -2.31 0.22
## HalfBath 1.00 1.00 1.36 -0.08 0.17
## BedroomAbvGr 4.00 2.00 -0.85 -1.17 0.34
## KitchenAbvGr 2.00 1.00 0.54 -1.96 0.21
## KitchenQual* 3.00 2.00 -0.17 -1.54 0.31
## TotRmsAbvGrd 11.00 6.00 -0.19 -1.39 0.83
## Functional* 2.00 1.00 -1.36 -0.08 0.17
## Fireplaces 2.00 2.00 0.85 -1.17 0.34
## FireplaceQu* 1.00 0.00 NaN NaN 0.00
## GarageType* 2.00 1.00 0.00 -2.31 0.22
## GarageYrBlt 1985.00 75.00 -0.26 -1.21 10.18
## GarageFinish* 2.00 1.00 0.00 -2.31 0.22
## GarageCars 4.00 3.00 1.08 -0.64 0.49
## GarageArea 864.00 612.00 1.00 -0.52 84.74
## GarageQual* 2.00 1.00 -1.36 -0.08 0.17
## GarageCond* 2.00 1.00 -1.36 -0.08 0.17
## PavedDrive* 2.00 1.00 -0.54 -1.96 0.21
## WoodDeckSF 196.00 196.00 0.38 -2.00 37.77
## OpenPorchSF 287.00 287.00 1.16 -0.43 45.45
## EnclosedPorch 386.00 386.00 0.82 -0.88 57.99
## X3SsnPorch 0.00 0.00 NaN NaN 0.00
## ScreenPorch 0.00 0.00 NaN NaN 0.00
## PoolArea 0.00 0.00 NaN NaN 0.00
## PoolQC* -Inf -Inf NA NA NA
## Fence* 2.00 1.00 0.00 -2.75 0.50
## MiscFeature* 1.00 0.00 NA NA NA
## MiscVal 2500.00 2500.00 1.36 -0.08 416.67
## MoSold 12.00 11.00 0.26 -1.72 1.66
## YrSold 2010.00 4.00 -0.71 -1.15 0.61
## SaleType* 1.00 0.00 NaN NaN 0.00
## SaleCondition* 2.00 1.00 -1.36 -0.08 0.17
## SalePrice 266500.00 163724.00 0.49 -1.96 32071.05
## log_SalePrice 12.49 0.95 0.43 -1.97 0.18
## ------------------------------------------------------------
## group: Wood
## vars n mean sd median trimmed mad min
## Id 1 3 799.67 687.51 1181.00 799.67 45.96 6.00
## MSSubClass 2 3 53.33 5.77 50.00 53.33 0.00 50.00
## MSZoning* 3 3 1.00 0.00 1.00 1.00 0.00 1.00
## LotFrontage 4 2 118.50 47.38 118.50 118.50 49.67 85.00
## LotArea 5 3 12473.00 1501.48 12134.00 12473.00 1429.23 11170.00
## Street* 6 3 1.00 0.00 1.00 1.00 0.00 1.00
## Alley* 7 0 NaN NA NA NaN NA Inf
## LotShape* 8 3 1.33 0.58 1.00 1.33 0.00 1.00
## LandContour* 9 3 1.67 0.58 2.00 1.67 0.00 1.00
## Utilities* 10 3 1.00 0.00 1.00 1.00 0.00 1.00
## LotConfig* 11 3 1.67 0.58 2.00 1.67 0.00 1.00
## LandSlope* 12 3 1.33 0.58 1.00 1.33 0.00 1.00
## Neighborhood* 13 3 2.00 1.00 2.00 2.00 1.48 1.00
## Condition1* 14 3 1.00 0.00 1.00 1.00 0.00 1.00
## Condition2* 15 3 1.00 0.00 1.00 1.00 0.00 1.00
## BldgType* 16 3 1.00 0.00 1.00 1.00 0.00 1.00
## HouseStyle* 17 3 1.33 0.58 1.00 1.33 0.00 1.00
## OverallQual 18 3 6.67 1.53 7.00 6.67 1.48 5.00
## OverallCond 19 3 5.67 1.15 5.00 5.67 0.00 5.00
## YearBuilt 20 3 1990.33 2.52 1990.00 1990.33 2.97 1988.00
## YearRemodAdd 21 3 1997.00 7.21 1995.00 1997.00 5.93 1991.00
## RoofStyle* 22 3 1.00 0.00 1.00 1.00 0.00 1.00
## RoofMatl* 23 3 1.00 0.00 1.00 1.00 0.00 1.00
## Exterior1st* 24 3 2.00 1.00 2.00 2.00 1.48 1.00
## Exterior2nd* 25 3 2.00 1.00 2.00 2.00 1.48 1.00
## MasVnrType* 26 3 1.00 0.00 1.00 1.00 0.00 1.00
## MasVnrArea 27 3 0.00 0.00 0.00 0.00 0.00 0.00
## ExterQual* 28 3 1.67 0.58 2.00 1.67 0.00 1.00
## ExterCond* 29 3 1.00 0.00 1.00 1.00 0.00 1.00
## Foundation* 30 3 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtQual* 31 3 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtCond* 32 3 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtExposure* 33 3 1.67 0.58 2.00 1.67 0.00 1.00
## BsmtFinType1* 34 3 1.33 0.58 1.00 1.33 0.00 1.00
## BsmtFinSF1 35 3 791.67 397.87 732.00 791.67 452.19 427.00
## BsmtFinType2* 36 3 1.00 0.00 1.00 1.00 0.00 1.00
## BsmtFinSF2 37 3 0.00 0.00 0.00 0.00 0.00 0.00
## BsmtUnfSF 38 3 65.33 66.01 64.00 65.33 94.89 0.00
## TotalBsmtSF 39 3 857.00 332.72 796.00 857.00 351.38 559.00
## Heating* 40 3 1.00 0.00 1.00 1.00 0.00 1.00
## HeatingQC* 41 3 1.33 0.58 1.00 1.33 0.00 1.00
## CentralAir* 42 3 1.00 0.00 1.00 1.00 0.00 1.00
## Electrical* 43 3 1.00 0.00 1.00 1.00 0.00 1.00
## X1stFlrSF 44 3 1058.00 251.72 1080.00 1058.00 323.21 796.00
## X2ndFlrSF 45 3 818.00 348.73 672.00 818.00 157.16 566.00
## LowQualFinSF 46 3 0.00 0.00 0.00 0.00 0.00 0.00
## GrLivArea 47 3 1876.00 585.92 1752.00 1876.00 578.21 1362.00
## BsmtFullBath 48 3 0.33 0.58 0.00 0.33 0.00 0.00
## BsmtHalfBath 49 3 0.00 0.00 0.00 0.00 0.00 0.00
## FullBath 50 3 1.67 0.58 2.00 1.67 0.00 1.00
## HalfBath 51 3 0.67 0.58 1.00 0.67 0.00 0.00
## BedroomAbvGr 52 3 3.00 1.73 4.00 3.00 0.00 1.00
## KitchenAbvGr 53 3 1.00 0.00 1.00 1.00 0.00 1.00
## KitchenQual* 54 3 1.00 0.00 1.00 1.00 0.00 1.00
## TotRmsAbvGrd 55 3 7.00 1.73 8.00 7.00 0.00 5.00
## Functional* 56 3 1.00 0.00 1.00 1.00 0.00 1.00
## Fireplaces 57 3 0.00 0.00 0.00 0.00 0.00 0.00
## FireplaceQu* 58 0 NaN NA NA NaN NA Inf
## GarageType* 59 3 1.33 0.58 1.00 1.33 0.00 1.00
## GarageYrBlt 60 3 1990.33 2.52 1990.00 1990.33 2.97 1988.00
## GarageFinish* 61 3 2.00 1.00 2.00 2.00 1.48 1.00
## GarageCars 62 3 2.00 0.00 2.00 2.00 0.00 2.00
## GarageArea 63 3 555.00 119.66 492.00 555.00 17.79 480.00
## GarageQual* 64 3 1.00 0.00 1.00 1.00 0.00 1.00
## GarageCond* 65 3 1.00 0.00 1.00 1.00 0.00 1.00
## PavedDrive* 66 3 1.00 0.00 1.00 1.00 0.00 1.00
## WoodDeckSF 67 3 121.67 177.22 40.00 121.67 59.30 0.00
## OpenPorchSF 68 3 14.00 15.10 12.00 14.00 17.79 0.00
## EnclosedPorch 69 3 0.00 0.00 0.00 0.00 0.00 0.00
## X3SsnPorch 70 3 106.67 184.75 0.00 106.67 0.00 0.00
## ScreenPorch 71 3 0.00 0.00 0.00 0.00 0.00 0.00
## PoolArea 72 3 0.00 0.00 0.00 0.00 0.00 0.00
## PoolQC* 73 0 NaN NA NA NaN NA Inf
## Fence* 74 2 1.50 0.71 1.50 1.50 0.74 1.00
## MiscFeature* 75 1 1.00 NA 1.00 1.00 0.00 1.00
## MiscVal 76 3 233.33 404.15 0.00 233.33 0.00 0.00
## MoSold 77 3 6.67 3.06 6.00 6.67 2.97 4.00
## YrSold 78 3 2008.33 2.08 2009.00 2008.33 1.48 2006.00
## SaleType* 79 3 1.00 0.00 1.00 1.00 0.00 1.00
## SaleCondition* 80 3 1.00 0.00 1.00 1.00 0.00 1.00
## SalePrice 81 3 185666.67 56695.09 164000.00 185666.67 31134.60 143000.00
## log_SalePrice 82 3 12.10 0.29 12.01 12.10 0.20 11.87
## max range skew kurtosis se
## Id 1212.00 1206.00 -0.38 -2.33 396.93
## MSSubClass 60.00 10.00 0.38 -2.33 3.33
## MSZoning* 1.00 0.00 NaN NaN 0.00
## LotFrontage 152.00 67.00 0.00 -2.75 33.50
## LotArea 14115.00 2945.00 0.21 -2.33 866.88
## Street* 1.00 0.00 NaN NaN 0.00
## Alley* -Inf -Inf NA NA NA
## LotShape* 2.00 1.00 0.38 -2.33 0.33
## LandContour* 2.00 1.00 -0.38 -2.33 0.33
## Utilities* 1.00 0.00 NaN NaN 0.00
## LotConfig* 2.00 1.00 -0.38 -2.33 0.33
## LandSlope* 2.00 1.00 0.38 -2.33 0.33
## Neighborhood* 3.00 2.00 0.00 -2.33 0.58
## Condition1* 1.00 0.00 NaN NaN 0.00
## Condition2* 1.00 0.00 NaN NaN 0.00
## BldgType* 1.00 0.00 NaN NaN 0.00
## HouseStyle* 2.00 1.00 0.38 -2.33 0.33
## OverallQual 8.00 3.00 -0.21 -2.33 0.88
## OverallCond 7.00 2.00 0.38 -2.33 0.67
## YearBuilt 1993.00 5.00 0.13 -2.33 1.45
## YearRemodAdd 2005.00 14.00 0.26 -2.33 4.16
## RoofStyle* 1.00 0.00 NaN NaN 0.00
## RoofMatl* 1.00 0.00 NaN NaN 0.00
## Exterior1st* 3.00 2.00 0.00 -2.33 0.58
## Exterior2nd* 3.00 2.00 0.00 -2.33 0.58
## MasVnrType* 1.00 0.00 NaN NaN 0.00
## MasVnrArea 0.00 0.00 NaN NaN 0.00
## ExterQual* 2.00 1.00 -0.38 -2.33 0.33
## ExterCond* 1.00 0.00 NaN NaN 0.00
## Foundation* 1.00 0.00 NaN NaN 0.00
## BsmtQual* 1.00 0.00 NaN NaN 0.00
## BsmtCond* 1.00 0.00 NaN NaN 0.00
## BsmtExposure* 2.00 1.00 -0.38 -2.33 0.33
## BsmtFinType1* 2.00 1.00 0.38 -2.33 0.33
## BsmtFinSF1 1216.00 789.00 0.15 -2.33 229.71
## BsmtFinType2* 1.00 0.00 NaN NaN 0.00
## BsmtFinSF2 0.00 0.00 NaN NaN 0.00
## BsmtUnfSF 132.00 132.00 0.02 -2.33 38.11
## TotalBsmtSF 1216.00 657.00 0.18 -2.33 192.10
## Heating* 1.00 0.00 NaN NaN 0.00
## HeatingQC* 2.00 1.00 0.38 -2.33 0.33
## CentralAir* 1.00 0.00 NaN NaN 0.00
## Electrical* 1.00 0.00 NaN NaN 0.00
## X1stFlrSF 1298.00 502.00 -0.09 -2.33 145.33
## X2ndFlrSF 1216.00 650.00 0.35 -2.33 201.34
## LowQualFinSF 0.00 0.00 NaN NaN 0.00
## GrLivArea 2514.00 1152.00 0.20 -2.33 338.28
## BsmtFullBath 1.00 1.00 0.38 -2.33 0.33
## BsmtHalfBath 0.00 0.00 NaN NaN 0.00
## FullBath 2.00 1.00 -0.38 -2.33 0.33
## HalfBath 1.00 1.00 -0.38 -2.33 0.33
## BedroomAbvGr 4.00 3.00 -0.38 -2.33 1.00
## KitchenAbvGr 1.00 0.00 NaN NaN 0.00
## KitchenQual* 1.00 0.00 NaN NaN 0.00
## TotRmsAbvGrd 8.00 3.00 -0.38 -2.33 1.00
## Functional* 1.00 0.00 NaN NaN 0.00
## Fireplaces 0.00 0.00 NaN NaN 0.00
## FireplaceQu* -Inf -Inf NA NA NA
## GarageType* 2.00 1.00 0.38 -2.33 0.33
## GarageYrBlt 1993.00 5.00 0.13 -2.33 1.45
## GarageFinish* 3.00 2.00 0.00 -2.33 0.58
## GarageCars 2.00 0.00 NaN NaN 0.00
## GarageArea 693.00 213.00 0.38 -2.33 69.09
## GarageQual* 1.00 0.00 NaN NaN 0.00
## GarageCond* 1.00 0.00 NaN NaN 0.00
## PavedDrive* 1.00 0.00 NaN NaN 0.00
## WoodDeckSF 325.00 325.00 0.36 -2.33 102.32
## OpenPorchSF 30.00 30.00 0.13 -2.33 8.72
## EnclosedPorch 0.00 0.00 NaN NaN 0.00
## X3SsnPorch 320.00 320.00 0.38 -2.33 106.67
## ScreenPorch 0.00 0.00 NaN NaN 0.00
## PoolArea 0.00 0.00 NaN NaN 0.00
## PoolQC* -Inf -Inf NA NA NA
## Fence* 2.00 1.00 0.00 -2.75 0.50
## MiscFeature* 1.00 0.00 NA NA NA
## MiscVal 700.00 700.00 0.38 -2.33 233.33
## MoSold 10.00 6.00 0.21 -2.33 1.76
## YrSold 2010.00 4.00 -0.29 -2.33 1.20
## SaleType* 1.00 0.00 NaN NaN 0.00
## SaleCondition* 1.00 0.00 NaN NaN 0.00
## SalePrice 250000.00 107000.00 0.33 -2.33 32732.93
## log_SalePrice 12.43 0.56 0.29 -2.33 0.17
PLOT Univariate Descriptive Statistics
Selecting 10 popular variables in describing houses for sale: (1) location of the house (Neighborhood), (2) number of full bathrooms (FullBath), (3) condition of the home (Condition 1 | Condition 2), (4) kitchen quality (KitchenQual), (5) Ground living area square feet (GrLivArea), (6) Month sold (MoSold), (7) Heating Type (Heating), (9) type of sale (SaleType), (10) and the dependent variable sale price (SalePrice).
The boxplot shows the SalePrice distribution for all listed homes with FullBath with plotly features. A home with 3 FullBath has the highest median SalePrice and no variation. A house with 1 FullBath has the lowest median SalePrice and alot of variation. A house with no FullBath seems most symmetrically distributed around its median value.
Variable | Description |
---|---|
FullBath | Full bathrooms above grade |
The dots above and below the FullBath groups indicate those data points are outliers (i.e., extremely high or low), as seen for the homes with one FullBath and two FullBath.
#library(ggplot2)
#library(plotly)
<- plot_ly(data = log_train, y = ~log_SalePrice, x = ~FullBath, color = ~FullBath, type = "box", showlegend = FALSE)
fig
fig
In the notched boxplot, it allows you to evaluate confidence intervals (by default 95% confidence interval) for the medians of each boxplot. The notch shows the level of uncertainty in the data.
In the plot, GrLivArea distribution is based on Heating types: indicating the group OtherW has the highest median value with no outliers. The group GasA indicates data points a significant amout of outliers. GasW groups seems to be symmetrically distributed around the median value.
Variable | Description |
---|---|
GrLivArea | Above grade (ground) living area square feet |
<- plot_ly(data = log_train, x = ~Heating, y = ~GrLivArea, type = "box", color = ~Heating, notched = TRUE, showlegend = FALSE)
fig2
fig2
Plots on Qualitative variables
The barplot indicates the frequency (counts) of KitchenQual for Home Prices data set. The KitchenQual variable has four categories: TA, Gd, Ex, and Fa as indicated in the Home Sales data set arranged in descending order.
KitchenQual | Kitchen quality |
---|---|
Ex | Excellent |
Gd | Good |
TA | Typical/Average |
Fa | Fair |
Po | Poor |
Boxplot shows the House SalePrice versus KitchenQual rating.
#library(scales)
ggplot(data=log_train[!is.na(log_train$KitchenQual),], aes(x=factor(KitchenQual), y=SalePrice))+
geom_boxplot(col='blue', fill = "peru") + labs(x='Kitchen Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
The plot_ly function plots the KitchenQual
categorical variable and indicates the rating for Excellent (Ex)
is an important criteria for HouseSale.
BarChart indicates North Ames (NAmes)
Neighborhood with the highest home sale based on Normal (Norm)
Conditions for 1 & 2.
Condition1 | Proximity to various conditions Condition2 | Proximity to various conditions (if more than one is present) ———–|————————————- Artery | Adjacent to arterial street Feedr | Adjacent to feeder street Norm | Normal RRNn | Within 200’ of North-South Railroad RRAn | Adjacent to North-South Railroad PosN | Near positive off-site feature–park, greenbelt, etc. PosA | Adjacent to positive off-site feature RRNe | Within 200’ of East-West Railroad RRAe | Adjacent to East-West Railroad
library(plotly)
<- plot_ly(log_train, x = ~ Neighborhood, y = ~ Condition1, type = 'bar', name = 'Condition 1')
plot_bar2 <- plot_bar2 %>% add_trace(y = ~ Condition2, name = 'Condition2')
plot_bar2 <- plot_bar2 %>% layout(yaxis = list(title = 'Count'), barmode = 'group')
plot_bar2 plot_bar2
Plot (Lollipop chart) with Categorical and Numerical Variable
Visualize the relationship between a categorical (SaleType) and numerical (MoSold) variable. The SaleType was converted to a factor variable and a created a table to show group frequency. In the plot between SaleType vs MoSold, the New group and WD group seems to have sold continuously each month and Con group has the least sale in the MoSold range.
<- as.factor(log_train$SaleType)
house_saletype table(house_saletype)
## house_saletype
## COD Con ConLD ConLI ConLw CWD New Oth WD
## 43 2 9 5 5 4 122 3 1267
# install.packages("ggplot2")
#library(ggplot2)
ggplot(log_train, aes(x = SaleType, y = MoSold)) +
geom_segment(aes(x = SaleType, xend = SaleType, y = 0, yend = MoSold),
color = "tomato", lwd = 1) +
geom_point(size = 4, pch = 21, bg = 4, col = 1) +
geom_text(aes(label = SaleType), color = "grey0", size = 3) +
scale_x_discrete(labels = paste("Group", 1:10)) +
theme(axis.text.x = element_text(angle = 90,
vjust = 0.5, hjust = 1))
Scatterplot matrix on two independent variables and dependent variable.
The scatterplots will indicate whether there is a potential link between two quantitative variables. The scatterplot correlation between the independent variables (BedroomAbvGr and GarageCars) and the dependent variable (log_SalePrice) shows no association between variables.
# car package
scatterplotMatrix(~log_SalePrice + BedroomAbvGr + GarageCars, data = log_train,
diagonal = FALSE, # Remove kernel density estimates
regLine = list(col = "green", # Linear regression line color
lwd = 3), # Linear regression line width
smooth = list(col.smooth = "red", # Non-parametric mean color
col.spread = "blue")) # Non-parametric variance color
Correlation matrix any three quantitative variables in the dataset.
Variable Identification / Restructure data set
Compute correlation matrix on df_train.num
data set to view numerical variables correlation using Pearson
method.
Indexing Numeric Variables
<- which(sapply(df_train, is.numeric)) #index vector numeric variables
numericVars <- names(numericVars) #saving names vector for use later on
numericVarNames cat('There are', length(numericVars), 'numeric variables')
## There are 38 numeric variables
#library(corrr)
<- df_train[, numericVars]
house_numVar <- cor(house_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
cor_numVar
#sort on decreasing correlations with SalePrice
<- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
cor_sorted #select only high corelations
<- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
CorHigh <- cor_numVar[CorHigh, CorHigh]
cor_numVar
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.srt=45)
There are 10 numeric variables (independent) with a correlation greater than 0.5 between the dependent variable, SalePrice. The top three correlated independent variables with SalePrice are: OverallQual, GrLivArea, and GarageCars.
#set correlation coefficient for the top three variables
<- data.frame(round(as.matrix(cor_numVar[1:4, 1:4]),2))) (corr_top3
## SalePrice OverallQual GrLivArea GarageCars
## SalePrice 1.00 0.79 0.71 0.64
## OverallQual 0.79 1.00 0.59 0.60
## GrLivArea 0.71 0.59 1.00 0.47
## GarageCars 0.64 0.60 0.47 1.00
#correlation matrix of the top 3 correlated variables with SalePrice
<- cor(corr_top3) corr3_matrix
#graph of correlation matrix - top 3 correlated variables
corrplot(corr3_matrix, method="pie", tl.srt = 45)
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. - Nullhypothesis,\(H_0\): The correlation between each pairwise set of variables is zero. - Alternative hypothesis,\(H_a\): The correlation between each pairwise set of variable is not equal to zero - Significance level,\(0.05\)
The Pearson correlation method is used to measure linear dependence between two variables (x and y), the correlation coefficient, hypothesis test, and confidence interval (80%).
#correlation test between SalePrice ~ OverallQual
cor.test(df_train$SalePrice, df_train$OverallQual, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_train$SalePrice and df_train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
#correlation test between SalePrice ~ GrLivArea
cor.test(df_train$SalePrice, df_train$GrLivArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_train$SalePrice and df_train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
#correlation test between SalePrice ~ GarageCars
cor.test(df_train$SalePrice, df_train$GarageCars, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_train$SalePrice and df_train$GarageCars
## t = 31.839, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6201771 0.6597899
## sample estimates:
## cor
## 0.6404092
Discuss the meaning of your analysis.
The Pearson pairwise correlation analysis reveals that the relationship between variables in the dataset is not equal to zero. We can conclude that OverallQual, GrLivArea, and GarageCars supports the alternative hypothesis of a correlation between each pairwise set is not equal to zero.
Would you be worried about familywise error? Why or why not?
A familywise error can exist to produce false positive results when conducting a multiple hypothesis tests at once. In the correlation model, it conducted more than 10 different comparisons using an alpha level of \(\alpha = .05\). The family-wise error rate would be calculated as:
<- 1 - (1-.05)^3
fw_error paste0("Family-wise error rate of ", round(fw_error, 3), " will increase the probability of an error on at least one of the hypothesis tests.")
## [1] "Family-wise error rate of 0.143 will increase the probability of an error on at least one of the hypothesis tests."
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.
#correlation matrix of the top 3 correlated variables with SalePrice
(corr3_matrix)
## SalePrice OverallQual GrLivArea GarageCars
## SalePrice 1.0000000 0.4823030 0.1171696 -0.3736374
## OverallQual 0.4823030 1.0000000 -0.3092674 -0.2739513
## GrLivArea 0.1171696 -0.3092674 1.0000000 -0.8293202
## GarageCars -0.3736374 -0.2739513 -0.8293202 1.0000000
#inverse correlation matrix
<- matrix_inverse(corr3_matrix)) (invcorr
## SalePrice OverallQual GrLivArea GarageCars
## SalePrice 1.418920414 -0.5642499 0.004725651 0.3961108
## OverallQual -0.564249916 0.9477893 -0.214494076 -0.3810742
## GrLivArea 0.004725651 -0.2144941 0.329767644 -0.2062645
## GarageCars 0.396110785 -0.3810742 -0.206264507 0.4591154
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
#multiply correlation matrix by precision matrix
<- corr3_matrix %*% invcorr) (corr_inv
## SalePrice OverallQual GrLivArea GarageCars
## SalePrice 0.99933292 0.01012303 0.0169814 0.01660699
## OverallQual 0.01012303 0.84638169 -0.2576948 -0.25201313
## GrLivArea 0.01698140 -0.25769483 0.5677167 -0.42275224
## GarageCars 0.01660699 -0.25201313 -0.4227522 0.58656868
#multiply precision matrix by correlation matrix
<- invcorr %*% corr3_matrix) (inv_corr
## SalePrice OverallQual GrLivArea GarageCars
## SalePrice 0.99933292 0.01012303 0.0169814 0.01660699
## OverallQual 0.01012303 0.84638169 -0.2576948 -0.25201313
## GrLivArea 0.01698140 -0.25769483 0.5677167 -0.42275224
## GarageCars 0.01660699 -0.25201313 -0.4227522 0.58656868
library(Matrix)
<- expand(lu(corr3_matrix))
matrix_exp
for( i in 1:nrow(corr3_matrix) ){
for( j in 1:ncol(corr3_matrix) ){
# This doesn't do anything, but here you can think about how to check
# where in the matrix you are by checking the relative values of i and j
= corr3_matrix[i,j]
corr3_matrix[i,j]
} }
Lower decomposition
<- matrix_exp$L
lu_lower $L matrix_exp
## 4 x 4 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 . . .
## [2,] 0.4823030 1.0000000 . .
## [3,] -0.3736374 -0.1221617 1.0000000 .
## [4,] 0.1171696 -0.4766567 -0.9779518 1.0000000
Upper decomposition
<- matrix_exp$U
lu_upper $U matrix_exp
## 4 x 4 Matrix of class "dtrMatrix"
## [,1] [,2] [,3] [,4]
## [1,] 1.000000e+00 4.823030e-01 1.171696e-01 -3.736374e-01
## [2,] . 7.673839e-01 -3.657786e-01 -9.374488e-02
## [3,] . . -8.302254e-01 8.489431e-01
## [4,] . . . 3.330669e-16
print(matrix_exp$P)
## 4 x 4 sparse Matrix of class "pMatrix"
##
## [1,] | . . .
## [2,] . | . .
## [3,] . . . |
## [4,] . . | .
Fit a closed form distribution to data. Selecting a variable in the Kaggle.com training dataset that is skewed to the right, and shifting it so that the minimum value is absolutely above zero if necessary.
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/Rdevel/library/MASS/html/fitdistr.html).
Compute the skewness of the data set, train.num
, consisting only numerical variables from train set dataframe. In the table, the variable MiscVal
has the highest value (positive “right” skew) and GarageYrBlt
has the lowest value (negative “left” skew).
Separating numerical and character columns for future statistical testing:
#subset numeric columns with dplyr
<- data.frame(select_if(log_train, is.numeric))
train.num <- lapply(train.num, function(x) as.numeric(as.character(x)))
train.num[] reactable(train.num, wrap = FALSE)
#subset character columns with dplyr
<- log_train[,!names(log_train) %in% colnames(train.num)]
train.char reactable(train.char, wrap = FALSE)
#library(moments)
<- data.frame(skewness(train.num)) #calculate skewness
train_skew <- cbind(Variable = rownames(train_skew), train_skew)
train_skew rownames(train_skew) <- 1:nrow(train_skew)
order(train_skew$skewness.train.num.), ] train_skew[
## Variable skewness.train.num.
## 7 YearBuilt -0.61283072
## 8 YearRemodAdd -0.50304450
## 27 GarageCars -0.34219690
## 1 Id 0.00000000
## 20 FullBath 0.03652398
## 37 YrSold 0.09616958
## 39 log_SalePrice 0.12121037
## 28 GarageArea 0.17979594
## 22 BedroomAbvGr 0.21157244
## 36 MoSold 0.21183506
## 5 OverallQual 0.21672098
## 18 BsmtFullBath 0.59545404
## 25 Fireplaces 0.64889763
## 21 HalfBath 0.67520283
## 24 TotRmsAbvGrd 0.67564577
## 6 OverallCond 0.69235521
## 15 X2ndFlrSF 0.81219427
## 12 BsmtUnfSF 0.91932270
## 17 GrLivArea 1.36515595
## 14 X1stFlrSF 1.37534174
## 2 MSSubClass 1.40621011
## 13 TotalBsmtSF 1.52268809
## 29 WoodDeckSF 1.53979170
## 10 BsmtFinSF1 1.68377090
## 38 SalePrice 1.88094075
## 30 OpenPorchSF 2.36191193
## 31 EnclosedPorch 3.08669647
## 19 BsmtHalfBath 4.09918567
## 33 ScreenPorch 4.11797738
## 11 BsmtFinSF2 4.25088802
## 23 KitchenAbvGr 4.48378409
## 16 LowQualFinSF 9.00208042
## 32 X3SsnPorch 10.29375236
## 4 LotArea 12.19514213
## 34 PoolArea 14.81313466
## 35 MiscVal 24.45163962
## 3 LotFrontage NA
## 9 MasVnrArea NA
## 26 GarageYrBlt NA
The histogram shows a right skewed
distribution, most of the data falls to the right, or positive side of the graph peak. The mode is the highest point of the histogram, whereas the median and mean fall to the right of it.
<- hist(train_skew$skewness.train.num., col=rainbow(10),
hs main = "Skewness of Training Variables", xlab = "Training data Distribution Count")
Plot the density distribution of selected variable and compare the observed distribution to what we would expect if it were perfectly normal (dashed red line).
Right Skewed
#library(ggpubr)
# Distribution of MisVal variable (right skewed)
ggdensity(train.num, x = "MiscVal", fill = "blue", title="MisVal") +
scale_x_continuous(limits = c(1000, 1600)) +
stat_overlay_normal_density(color = "red", linetype = "dashed", lwd = 2)
## Warning: Removed 1455 rows containing non-finite values (stat_density).
## Warning: Removed 1455 rows containing non-finite values
## (stat_overlay_normal_density).
## Warning: Removed 13 row(s) containing missing values (geom_path).
The summary
function shows the data distribution for the variable.
# Select a variable in the Kaggle.com training dataset that is skewed to the right (MiscVal)
# Distribution of MisVal variable (right skewed)
<- train.num$MiscVal
skew_MiscVal summary(skew_MiscVal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 43.49 0.00 15500.00
Left Skewed
# Distribution of GarageYrBlt variable (left skewed)
ggdensity(train.num, x = "YearBuilt", fill = "blue", title = "YearBuilt") +
scale_x_continuous(limits = c(1800, 2100)) +
stat_overlay_normal_density(color = "red", linetype = "dashed", lwd=2)
The summary
function shows the data distribution for the variable.
# Distribution of YearBuilt variable (left skewed)
<- train.num$YearBuilt
skew_YearBuilt summary(skew_YearBuilt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
The summary
function shows the distribution for the variable MiscVal
shifted above zero.
# Distribution of MisVal variable (right skewed) with the min value shift above zero
<- skew_MiscVal+1
skew_MiscVal2 summary(skew_MiscVal2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 44.49 1.00 15501.00
Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.
Run function fitdistr from the MASS package to fit an exponential probability density function (PDF)
# Then load the MASS package and run fitdistr to fit an exponential probability density function (PDF)
#library(MASS)
set.seed(1234)
<- fitdistr(skew_MiscVal2, densfun="exponential") exp.pdf
The optimal value of lambda for the distribution is the estimate attribute from the fitdistr response. The value is output below.
# Find the optimal value of lambda for this distribution
$estimate exp.pdf
## rate
## 0.02247745
<- exp.pdf$estimate lambda
Generate 1000 samples using the lambda value.
# then take 1000 samples from this exponential distribution using this value
<- rexp(1000, lambda)
samples 1:5]) (samples[
## [1] 111.3008414 10.9780661 0.2928249 77.5331024 17.2253819
The below histogram based on the 1000 samples shows a decreased range of values across the x-axis along with a less concentrated count along the y-axis. Yes, the new histogram is still right skewed, but not to same the degree. From visual inspection, the second bucket of the below histogram is much closer to half of the first bucket as compared to the initial histogram. Overall the data is a bit more uniformly distributed, though not all completely uniform, nor normal, and the range of values has decreased by almost half.
# Plot a histogram and compare it with a histogram of your original variable.
hist(samples)
Finally, provide the empirical 5th percentile and 95th percentile of the data.
# Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function
<- qexp(.05, rate=lambda, lower.tail=T)
per5 <- qexp(.95, rate=lambda, lower.tail=T) per95
Given the lambda of the exponential PDF, the 5th percentile is 2.2819895 and the 95th percentile is 133.2772562.
Required Libraries
library(ggplot2)
library(scales)
library(ggrepel)
Combine data
#create new dataframe
<- df_train
df.train <- df_test df.test
#Getting rid of the IDs but keeping the test IDs in a vector. These are needed to compose the submission file
<- df.test$Id
test.labels $Id <- NULL
df.test$Id <- NULL df.train
Since test dataset has no “Saleprice” variable. We will create it and then combine.
$SalePrice <- rep(NA, 1459)
df.test<- rbind(df.train, df.test) house
Check the variables numeric summary of the data (minimum, median, mean, and maximum) values of the independent variables and dependent variable:
View the object summaries depending on class (numeric): minimum value, maximum value, mean value, 1st quartile (25th percentile), and 3rd quartile (75th percentile)
#data exploration
head(house)
## MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
dim(house)
## [1] 2919 80
View the object summaries depending on class (numeric): minimum value, maximum value, mean value, 1st quartile (25th percentile), and 3rd quartile (75th percentile)
summary(house)
## MSSubClass MSZoning LotFrontage LotArea
## Min. : 20.00 Length:2919 Min. : 21.00 Min. : 1300
## 1st Qu.: 20.00 Class :character 1st Qu.: 59.00 1st Qu.: 7478
## Median : 50.00 Mode :character Median : 68.00 Median : 9453
## Mean : 57.14 Mean : 69.31 Mean : 10168
## 3rd Qu.: 70.00 3rd Qu.: 80.00 3rd Qu.: 11570
## Max. :190.00 Max. :313.00 Max. :215245
## NA's :486
## Street Alley LotShape LandContour
## Length:2919 Length:2919 Length:2919 Length:2919
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Utilities LotConfig LandSlope Neighborhood
## Length:2919 Length:2919 Length:2919 Length:2919
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Condition1 Condition2 BldgType HouseStyle
## Length:2919 Length:2919 Length:2919 Length:2919
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965
## Median : 6.000 Median :5.000 Median :1973 Median :1993
## Mean : 6.089 Mean :5.565 Mean :1971 Mean :1984
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## RoofStyle RoofMatl Exterior1st Exterior2nd
## Length:2919 Length:2919 Length:2919 Length:2919
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:2919 Min. : 0.0 Length:2919 Length:2919
## Class :character 1st Qu.: 0.0 Class :character Class :character
## Mode :character Median : 0.0 Mode :character Mode :character
## Mean : 102.2
## 3rd Qu.: 164.0
## Max. :1600.0
## NA's :23
## Foundation BsmtQual BsmtCond BsmtExposure
## Length:2919 Length:2919 Length:2919 Length:2919
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## Length:2919 Min. : 0.0 Length:2919 Min. : 0.00
## Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00
## Mode :character Median : 368.5 Mode :character Median : 0.00
## Mean : 441.4 Mean : 49.58
## 3rd Qu.: 733.0 3rd Qu.: 0.00
## Max. :5644.0 Max. :1526.00
## NA's :1 NA's :1
## BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.0 Min. : 0.0 Length:2919 Length:2919
## 1st Qu.: 220.0 1st Qu.: 793.0 Class :character Class :character
## Median : 467.0 Median : 989.5 Mode :character Mode :character
## Mean : 560.8 Mean :1051.8
## 3rd Qu.: 805.5 3rd Qu.:1302.0
## Max. :2336.0 Max. :6110.0
## NA's :1 NA's :1
## CentralAir Electrical X1stFlrSF X2ndFlrSF
## Length:2919 Length:2919 Min. : 334 Min. : 0.0
## Class :character Class :character 1st Qu.: 876 1st Qu.: 0.0
## Mode :character Mode :character Median :1082 Median : 0.0
## Mean :1160 Mean : 336.5
## 3rd Qu.:1388 3rd Qu.: 704.0
## Max. :5095 Max. :2065.0
##
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:1126 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.000 Median :1444 Median :0.0000 Median :0.00000
## Mean : 4.694 Mean :1501 Mean :0.4299 Mean :0.06136
## 3rd Qu.: 0.000 3rd Qu.:1744 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1064.000 Max. :5642 Max. :3.0000 Max. :2.00000
## NA's :2 NA's :2
## FullBath HalfBath BedroomAbvGr KitchenAbvGr
## Min. :0.000 Min. :0.0000 Min. :0.00 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.00 1st Qu.:1.000
## Median :2.000 Median :0.0000 Median :3.00 Median :1.000
## Mean :1.568 Mean :0.3803 Mean :2.86 Mean :1.045
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.00 3rd Qu.:1.000
## Max. :4.000 Max. :2.0000 Max. :8.00 Max. :3.000
##
## KitchenQual TotRmsAbvGrd Functional Fireplaces
## Length:2919 Min. : 2.000 Length:2919 Min. :0.0000
## Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.0000
## Mode :character Median : 6.000 Mode :character Median :1.0000
## Mean : 6.452 Mean :0.5971
## 3rd Qu.: 7.000 3rd Qu.:1.0000
## Max. :15.000 Max. :4.0000
##
## FireplaceQu GarageType GarageYrBlt GarageFinish
## Length:2919 Length:2919 Min. :1895 Length:2919
## Class :character Class :character 1st Qu.:1960 Class :character
## Mode :character Mode :character Median :1979 Mode :character
## Mean :1978
## 3rd Qu.:2002
## Max. :2207
## NA's :159
## GarageCars GarageArea GarageQual GarageCond
## Min. :0.000 Min. : 0.0 Length:2919 Length:2919
## 1st Qu.:1.000 1st Qu.: 320.0 Class :character Class :character
## Median :2.000 Median : 480.0 Mode :character Mode :character
## Mean :1.767 Mean : 472.9
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :5.000 Max. :1488.0
## NA's :1 NA's :1
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Length:2919 Min. : 0.00 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Mode :character Median : 0.00 Median : 26.00 Median : 0.0
## Mean : 93.71 Mean : 47.49 Mean : 23.1
## 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.0
## Max. :1424.00 Max. :742.00 Max. :1012.0
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Length:2919
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 Class :character
## Median : 0.000 Median : 0.00 Median : 0.000 Mode :character
## Mean : 2.602 Mean : 16.06 Mean : 2.252
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.000 Max. :576.00 Max. :800.000
##
## Fence MiscFeature MiscVal MoSold
## Length:2919 Length:2919 Min. : 0.00 Min. : 1.000
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 4.000
## Mode :character Mode :character Median : 0.00 Median : 6.000
## Mean : 50.83 Mean : 6.213
## 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :17000.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 Length:2919 Length:2919 Min. : 34900
## 1st Qu.:2007 Class :character Class :character 1st Qu.:129975
## Median :2008 Mode :character Mode :character Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
## NA's :1459
Dependent(Response) Variable
summary(house$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 34900 129975 163000 180921 214000 755000 1459
Predictors (Numeric)
Correlations with SalePrice
<- which(sapply(house, is.numeric)) #index vector numeric variables
numericHouse <- names(numericHouse) #saving names vector for use later on
numericHouseNames cat('There are', length(numericHouse), 'numeric variables')
## There are 37 numeric variables
#library(scales)
ggplot(data=house[!is.na(house$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
geom_boxplot(col='blue', fill = "pink") + labs(x='Overall Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
Above Grade (Ground) Living Area (square feet)
library(ggrepel)
ggplot(data=house[!is.na(house$SalePrice),], aes(x=GrLivArea, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
geom_text_repel(aes(label = ifelse(house$GrLivArea[!is.na(house$SalePrice)]>4500, rownames(house), '')))
## `geom_smooth()` using formula 'y ~ x'
#outliner
c(524, 1299), c('SalePrice', 'GrLivArea', 'OverallQual')] house[
## SalePrice GrLivArea OverallQual
## 524 184750 4676 10
## 1299 160000 5642 10
Missing data, label encoding, and factorizing variables
Check NULL and NA values in data frame columns
#check dataset for NULL values
is.null(house)
## [1] FALSE
#check dataset NA values
<- which(colSums(is.na(house)) > 0)
NAcol sort(colSums(sapply(house[NAcol], is.na)), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## 2909 2814 2721 2348 1459 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond GarageType
## 486 159 159 159 159 157
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 MasVnrType
## 82 82 81 80 79 24
## MasVnrArea MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 23 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
#replace na value with zeros in r dataframe
<- house
nuhouse is.na(nuhouse)] = 0
nuhouse[any(is.na(nuhouse))
## [1] FALSE
Perform dummy
or treatment
coding for categorical variables for use in regression or ANOVA. This coding will consists of creating dichotomous variables where each level of the categorical variable is contrasted to a specified reference level.
# creating the factor variable
#all_dataftr <- all_data %>% mutate_if(is.integer, as.numeric)
<- nuhouse %>% mutate_if(is.character, as.factor) houseFactor
Changing some numeric variables into factors Variables with NA’s are complete with zero value, and all character variables are converted into either numeric labels of into factors. There are some variables that are recorded as numeric and will be revalued as a categorical variable.
These classes are coded as numbers, but really are categories.
#MSSubClass (integer)
str(houseFactor$MSSubClass)
## int [1:2919] 60 20 60 70 60 50 20 60 50 190 ...
#MSubClass (factor)
$MSSubClass <- as.factor(houseFactor$MSSubClass)
houseFactor
#library(plyr)
#revalue for better readability (plyr package)
$MSSubClass<-revalue(houseFactor$MSSubClass, c('20'='1 story 1946+', '30'='1 story 1945-', '40'='1 story unf attic', '45'='1,5 story unf', '50'='1,5 story fin', '60'='2 story 1946+', '70'='2 story 1945-', '75'='2,5 story all ages', '80'='split/multi level', '85'='split foyer', '90'='duplex all style/age', '120'='1 story PUD 1946+', '150'='1,5 story PUD all', '160'='2 story PUD 1946+', '180'='PUD multilevel', '190'='2 family conversion'))
houseFactor
str(houseFactor$MSSubClass)
## Factor w/ 16 levels "1 story 1946+",..: 6 1 6 7 6 5 1 6 5 16 ...
#YrSold (integer)
str(houseFactor$YrSold)
## int [1:2919] 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
#YrSold (factor)
$YrSold <- as.factor(houseFactor$YrSold)
houseFactorstr(houseFactor$YrSold)
## Factor w/ 5 levels "2006","2007",..: 3 2 3 1 3 4 2 4 3 3 ...
#MoSold (integer)
str(houseFactor$MoSold)
## int [1:2919] 2 5 9 2 12 10 8 11 4 1 ...
#MoSold (factor)
$MoSold <- as.factor(houseFactor$MoSold)
houseFactorstr(houseFactor$MoSold)
## Factor w/ 12 levels "1","2","3","4",..: 2 5 9 2 12 10 8 11 4 1 ...
#YearBuilt (integer)
str(houseFactor$YearBuilt)
## int [1:2919] 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
#YearBuilt (factor)
$YearBuilt <- as.factor(houseFactor$YearBuilt)
houseFactorstr(houseFactor$YearBuilt)
## Factor w/ 118 levels "1872","1875",..: 111 84 109 26 108 101 112 81 42 49 ...
#YearRemodAdd (integer)
str(houseFactor$YearRemodAdd)
## int [1:2919] 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
#YearRemodAdd (factor)
$YearRemodAdd <- as.factor(houseFactor$YearRemodAdd)
houseFactorstr(houseFactor$YearRemodAdd)
## Factor w/ 61 levels "1950","1951",..: 54 27 53 21 51 46 56 24 1 1 ...
#OverallQual (integer)
str(houseFactor$OverallQual)
## int [1:2919] 7 6 7 7 8 5 8 7 7 5 ...
#OverallQual (factor)
$OverallQual <- as.factor(houseFactor$OverallQual)
houseFactorstr(houseFactor$OverallQual)
## Factor w/ 10 levels "1","2","3","4",..: 7 6 7 7 8 5 8 7 7 5 ...
#OverallCond (integer)
str(houseFactor$OverallCond)
## int [1:2919] 5 8 5 5 5 5 5 6 5 6 ...
#OverallCond (factor)
$OverallCond <- as.factor(houseFactor$OverallCond)
houseFactorstr(houseFactor$OverallCond)
## Factor w/ 9 levels "1","2","3","4",..: 5 8 5 5 5 5 5 6 5 6 ...
Summary View
head(houseFactor)
## MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 2 story 1946+ RL 65 8450 Pave 0 Reg Lvl
## 2 1 story 1946+ RL 80 9600 Pave 0 Reg Lvl
## 3 2 story 1946+ RL 68 11250 Pave 0 IR1 Lvl
## 4 2 story 1945- RL 60 9550 Pave 0 IR1 Lvl
## 5 2 story 1946+ RL 84 14260 Pave 0 IR1 Lvl
## 6 1,5 story fin RL 85 14115 Pave 0 IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 0 Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 0 Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 272 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 320 0 0 0 MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
The correlation table shows the relationship between numeric variables, and seems relatively small.
#library(corrr)
#set correlation coefficient (or covariance) to "pearson (default)"
<- correlate(select_if(houseFactor, is.numeric), diagonal = 1)
houseCor houseCor
## # A tibble: 30 x 31
## term LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LotFrontage 1 0.135 0.109 0.0692 -0.00468 0.137
## 2 LotArea 0.135 1 0.125 0.194 0.0841 0.0216
## 3 MasVnrArea 0.109 0.125 1 0.302 -0.0146 0.0882
## 4 BsmtFinSF1 0.0692 0.194 0.302 1 -0.0549 -0.477
## 5 BsmtFinSF2 -0.00468 0.0841 -0.0146 -0.0549 1 -0.238
## 6 BsmtUnfSF 0.137 0.0216 0.0882 -0.477 -0.238 1
## 7 TotalBsmtSF 0.206 0.254 0.394 0.537 0.0896 0.413
## 8 X1stFlrSF 0.242 0.332 0.392 0.458 0.0844 0.297
## 9 X2ndFlrSF -0.00466 0.0315 0.119 -0.162 -0.0977 -0.0000324
## 10 LowQualFinSF 0.0190 0.000554 -0.0574 -0.0660 -0.00491 0.0469
## # ... with 20 more rows, and 24 more variables: TotalBsmtSF <dbl>,
## # X1stFlrSF <dbl>, X2ndFlrSF <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, TotRmsAbvGrd <dbl>,
## # Fireplaces <dbl>, GarageYrBlt <dbl>, GarageCars <dbl>, GarageArea <dbl>,
## # WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>, X3SsnPorch <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, MiscVal <dbl>, SalePrice <dbl>
The histogram shows a almost symmetrical
distribution, the mean and median of the data are roughly the same and are approximately at the center of the data.
hist(log(houseFactor$SalePrice), col = "blue", border = "yellow",
main = "Natural Log Distribution of Sale Price", xlab = "Sale Price")
The pairs() function provides a plot matrix
, consisting of scatterplots
for only eight numeric variable-combination from the houseFactor
dataframe.
The pairwise combination plot shows: * the data frame names of the numeric variables diagonally * the other cells of the plot matrix show a scatterplot (i.e. correlation plot) of each variable combination * the left figure in second row illustrates the correlation between log_SalePrice and MoSol and so on …
#use select_if() function to select only numeric variables
pairs(~ SalePrice + LotArea + PoolArea + X1stFlrSF + LotFrontage + Fireplaces + KitchenAbvGr + LowQualFinSF, data = houseFactor, gap = 0.5, main = "Pairs matrix", pch = 21,
bg = c("red", "green3", "blue", "yellow"), upper.panel = NULL)
We will check this after we make the model.
PreProcessing predictor variables
#subset numeric columns with dplyr
<- data.frame(select_if(houseFactor, is.numeric))
houseNums <- lapply(houseNums, function(x) as.numeric(as.character(x)))
houseNums[]
#subset character columns with dplyr
<- houseFactor[,!names(houseFactor) %in% colnames(houseNums)]
houseChars <- houseChars[, names(houseChars) != 'SalePrice']
houseChars
cat('There are', length(houseNums), 'numeric variables, and', length(houseChars), 'factor variables')
## There are 30 numeric variables, and 50 factor variables
8.3.1 Skewness and normalizing of the numeric predictors
#library(psych)
for(i in 1:ncol(houseNums)){
if (abs(skew(houseNums[,i]))>0.8){
<- log(houseNums[,i] +1)
houseNums[,i]
} }
Normalizing the data
library(caret)
<- preProcess(houseNums, method=c("center", "scale"))
PreNum (PreNum)
## Created from 2919 samples and 30 variables
##
## Pre-processing:
## - centered (30)
## - ignored (0)
## - scaled (30)
<- predict(PreNum, houseNums)
DFnorm dim(DFnorm)
## [1] 2919 30
To do this one-hot encoding, I am using the model.matrix() function.
<- as.data.frame(model.matrix(~.-1, houseChars))
DFdummies dim(DFdummies)
## [1] 2919 457
#check if some values are absent in the test set
<- which(colSums(DFdummies[(nrow(houseFactor[!is.na(houseFactor$SalePrice),])+1):nrow(houseFactor),])==0)
ZerocolTest colnames(DFdummies[ZerocolTest])
## character(0)
Also taking out variables with less than 10 ‘ones’ in the train set.
<- which(colSums(DFdummies[1:nrow(houseFactor[!is.na(houseFactor$SalePrice),]),])<10)
fewOnes colnames(DFdummies[fewOnes])
## [1] "MSSubClass1 story unf attic" "MSSubClass1,5 story PUD all"
## [3] "UtilitiesNoSeWa" "Condition1RRNe"
## [5] "Condition1RRNn" "Condition2PosA"
## [7] "Condition2PosN" "Condition2RRAe"
## [9] "Condition2RRAn" "Condition2RRNn"
## [11] "HouseStyle2.5Fin" "YearBuilt1875"
## [13] "YearBuilt1879" "YearBuilt1880"
## [15] "YearBuilt1882" "YearBuilt1885"
## [17] "YearBuilt1890" "YearBuilt1892"
## [19] "YearBuilt1893" "YearBuilt1895"
## [21] "YearBuilt1896" "YearBuilt1898"
## [23] "YearBuilt1901" "YearBuilt1902"
## [25] "YearBuilt1904" "YearBuilt1905"
## [27] "YearBuilt1906" "YearBuilt1907"
## [29] "YearBuilt1908" "YearBuilt1911"
## [31] "YearBuilt1912" "YearBuilt1913"
## [33] "YearBuilt1914" "YearBuilt1917"
## [35] "YearBuilt1919" "YearBuilt1927"
## [37] "YearBuilt1928" "YearBuilt1929"
## [39] "YearBuilt1931" "YearBuilt1932"
## [41] "YearBuilt1934" "YearBuilt1937"
## [43] "YearBuilt1942" "YearBuilt1981"
## [45] "YearBuilt1982" "YearBuilt1983"
## [47] "YearBuilt1985" "YearBuilt1987"
## [49] "YearBuilt1989" "YearBuilt2010"
## [51] "YearRemodAdd1982" "RoofStyleShed"
## [53] "RoofMatlMembran" "RoofMatlMetal"
## [55] "RoofMatlRoll" "RoofMatlWdShake"
## [57] "RoofMatlWdShngl" "Exterior1stAsphShn"
## [59] "Exterior1stBrkComm" "Exterior1stCBlock"
## [61] "Exterior1stImStucc" "Exterior1stStone"
## [63] "Exterior2ndAsphShn" "Exterior2ndCBlock"
## [65] "Exterior2ndOther" "Exterior2ndStone"
## [67] "ExterCondPo" "FoundationWood"
## [69] "BsmtCondPo" "HeatingGrav"
## [71] "HeatingOthW" "HeatingWall"
## [73] "HeatingQCPo" "ElectricalFuseP"
## [75] "ElectricalMix" "FunctionalMaj2"
## [77] "FunctionalSev" "GarageQualEx"
## [79] "GarageQualPo" "GarageCondEx"
## [81] "PoolQCEx" "PoolQCFa"
## [83] "PoolQCGd" "MiscFeatureGar2"
## [85] "MiscFeatureOthr" "MiscFeatureTenC"
## [87] "SaleTypeCon" "SaleTypeConLI"
## [89] "SaleTypeConLw" "SaleTypeOth"
<- DFdummies[,-fewOnes] #removing predictors
DFdummies dim(DFdummies)
## [1] 2919 367
<- cbind(DFnorm, DFdummies) #combining all (now numeric) predictors into one dataframe comb_house
8.5 Composing train and test sets
$SalePrice <- log((houseFactor$SalePrice)+1) #default is the natural logarithm, "+1" is not necessary as there are no 0's houseFactor
<- comb_house[!is.na(houseFactor$SalePrice),]
train.house <- comb_house[is.na(houseFactor$SalePrice),] test.house
Caret Package Model - LeapBackward Method
library(leaps)
library(caret)
# Set seed for reproducibility
set.seed(123)
# Set up repeated k-fold cross-validation
<- trainControl(method = "cv", number = 10)
train.control # Train the model
<- train(SalePrice ~., data = train.house,
step.modelhouse method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:6),
trControl = train.control
)
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
## Reordering variables and trying again:
#step.modelhouse$results
$bestTune step.modelhouse
## nvmax
## 4 4
step.modelhouse
## Linear Regression with Backwards Selection
##
## 2919 samples
## 396 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2626, 2627, 2627, 2627, 2627, 2627, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 1.000388 0.003982680 0.9987917
## 2 1.001328 0.001858748 0.9991230
## 3 1.001086 0.002615625 0.9987535
## 4 1.000104 0.003540293 0.9975237
## 5 1.000943 0.002447034 0.9982924
## 6 1.001125 0.002327971 0.9979521
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 4.
plot(step.modelhouse)
Homoscedasticity
Plot fitted vs residual plot
# produce a residual vs fitted plot for visulaizting heteroscedasticity
<- resid(step.modelhouse)
plotres plot(fitted(step.modelhouse), plotres,
pch = 21, col="brown")
abline(0,0, lwd = 3, col = "blue")
Q-Q Plot: the plot shows the residuals generated follow a roughly normal distribution with a heavy bottom tail. The majority of the data points falls on straight line of 45 degree angles, and the data is likely normally distributed.
#create Q-Q plot for residuals
qqnorm(plotres, col="blue")
#add a straight diagonal line to the plot
qqline(plotres, lwd = 3, col = "red")
The density plot shows the residuals are normally distributed. The data has near symmetric rough bell-shaped curve that follows a normal distribution.
#Create density plot of residuals
plot(density(plotres), lwd = 4, col = "purple")
Prediction
= predict(step.modelhouse, test.house)
predHouse <- exp(predHouse)*100000
predHouse2 #predHouse2
#submit <- data.frame(Id = test.labels, SalePrice = predHouse2)
#write.csv(submit, file="C:/Users/andre/OneDrive/Documents/GitHub/DATA605/Final Exam/Kaggle_Submission.csv", quote=FALSE, row.names=FALSE)
Additional Final Reports