The dataset of interest for this report is the real estate prices from kaggle.com.
The dataset is stored in a github repostiory and read into R programing environment directly. The first 6 rows of the data given are as below. The target variable is under “SalePrice” column that shows the dollar values of the houses last sold.
## Id MSSubClass MSZoning LotFrontage LotArea Street
## 1 1 60 RL 65 8450 Pave
## 2 2 20 RL 80 9600 Pave
## 3 3 60 RL 68 11250 Pave
## 4 4 70 RL 60 9550 Pave
## 5 5 60 RL 84 14260 Pave
## 6 6 50 RL 85 14115 Pave
A corresponding short description of the predictors variables are given as below.
## # A tibble: 81 x 2
## Columns_Name Short_Desc
## <fct> <fct>
## 1 Id RowINdex
## 2 MSSubClass Building Class
## 3 MSZoning Zoning Classification
## 4 LotFrontage Linear feet of street connnected to property
## 5 LotArea Lot size in square feet
## 6 Street Type of road access
## 7 Alley Type of alley access
## 8 LotShape General shape of property
## 9 LandContour Flatness of the property
## 10 Utilities Type of utilities available
## 11 LotConfig Lot Configuration
## 12 LandSlope Slope of property
## 13 Neighborhood Physical locations within Ames city limits
## # ... with 68 more rows
The data is checked to see if it contains any blank or NA values.
## na_count col_names
## 17 1453 PoolQC
## 19 1406 MiscFeature
## 2 1369 Alley
## 18 1179 Fence
## 11 690 FireplaceQu
## 1 259 LotFrontage
## 12 81 GarageType
## 13 81 GarageYrBlt
## 14 81 GarageFinish
## 15 81 GarageQual
## 16 81 GarageCond
## 7 38 BsmtExposure
## 9 38 BsmtFinType2
## 5 37 BsmtQual
## 6 37 BsmtCond
## 8 37 BsmtFinType1
## 3 8 MasVnrType
## 4 8 MasVnrArea
## 10 1 Electrical
There are quite a few categorical variables as shown in below examples.
#unique(housetrain$BldgType)
#unique(housetrain$MiscFeature)
#unique(housetrain$PoolQC)
#unique(housetrain$MiscFeature)
#unique(housetrain$Alley)
#unique(housetrain$Fence)
#unique(housetrain$FireplaceQu)
#unique(housetrain$GarageType)
unique(housetrain$GarageYrBlt)
## [1] 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006
## [15] 1960 1991 1970 1967 1958 1930 2002 1968 2007 2008 1957 1920 1966 1959
## [29] 1995 1954 1953 NA 1983 1977 1997 1985 1963 1981 1964 1999 1935 1990
## [43] 1945 1987 1989 1915 1956 1948 1974 2009 1950 1961 1921 1900 1979 1951
## [57] 1969 1936 1975 1971 1923 1984 1926 1955 1986 1988 1916 1932 1972 1918
## [71] 1980 1924 1996 1940 1949 1994 1910 1978 1982 1992 1925 1941 2010 1927
## [85] 1947 1937 1942 1938 1952 1928 1922 1934 1906 1914 1946 1908 1929 1933
unique(housetrain$GarageFinish)
## [1] RFn Unf Fin <NA>
## Levels: Fin RFn Unf
unique(housetrain$GarageQual)
## [1] TA Fa Gd <NA> Ex Po
## Levels: Ex Fa Gd Po TA
#unique(crimetrain$LotFrontage)
We will separate out the categorical data columns with missing variables and create a new data frame.
## Id MSSubClass MSZoning LotFrontage LotArea Street
## 1 1 60 RL 65 8450 Pave
## 2 2 20 RL 80 9600 Pave
## 3 3 60 RL 68 11250 Pave
## 4 4 70 RL 60 9550 Pave
## 5 5 60 RL 84 14260 Pave
## 6 6 50 RL 85 14115 Pave
The next step is separating out columns with numerical values only. Date data such as “YearBuilt” are considered as categorical data and excluded. Date data can be transformed into numerical value by calculating the year length.
housetrain2 <- house_train[ ,!sapply(house_train, is.factor)]
housetrain2 <- housetrain2[, -which(colnames(housetrain2)%in% c('YearBuilt','YearRemodAdd','YrSold','Id'))]
housetrain2[1:6,1:6]
## MSSubClass LotFrontage LotArea OverallQual OverallCond BsmtFinSF1
## 1 60 65 8450 7 5 706
## 2 20 80 9600 6 8 978
## 3 60 68 11250 7 5 486
## 4 70 60 9550 7 5 216
## 5 60 84 14260 8 5 655
## 6 50 85 14115 5 5 732
Below plot shows there are missing values in “LotFrontage”; 18% of them are missing.
aggr_plot <- aggr(housetrain2, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## LotFrontage 0.1773973
## MSSubClass 0.0000000
## LotArea 0.0000000
## OverallQual 0.0000000
## OverallCond 0.0000000
## BsmtFinSF1 0.0000000
## BsmtFinSF2 0.0000000
## BsmtUnfSF 0.0000000
## TotalBsmtSF 0.0000000
## X1stFlrSF 0.0000000
## X2ndFlrSF 0.0000000
## LowQualFinSF 0.0000000
## GrLivArea 0.0000000
## BsmtFullBath 0.0000000
## BsmtHalfBath 0.0000000
## FullBath 0.0000000
## HalfBath 0.0000000
## BedroomAbvGr 0.0000000
## KitchenAbvGr 0.0000000
## TotRmsAbvGrd 0.0000000
## Fireplaces 0.0000000
## GarageCars 0.0000000
## GarageArea 0.0000000
## WoodDeckSF 0.0000000
## OpenPorchSF 0.0000000
## EnclosedPorch 0.0000000
## X3SsnPorch 0.0000000
## ScreenPorch 0.0000000
## PoolArea 0.0000000
## MiscVal 0.0000000
## MoSold 0.0000000
## SalePrice 0.0000000
“MICE” package is used to calculate the missing value, specifically, predictive mean matching was used as an imputation method. The graph now show there is no more missing value.
housetrain3 <- mice(housetrain2, m = 5, maxit =50, method = 'pmm', seed = 500)
housetrain4 <- complete(housetrain3,1)
aggr_plot <- aggr(housetrain4, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## MSSubClass 0
## LotFrontage 0
## LotArea 0
## OverallQual 0
## OverallCond 0
## BsmtFinSF1 0
## BsmtFinSF2 0
## BsmtUnfSF 0
## TotalBsmtSF 0
## X1stFlrSF 0
## X2ndFlrSF 0
## LowQualFinSF 0
## GrLivArea 0
## BsmtFullBath 0
## BsmtHalfBath 0
## FullBath 0
## HalfBath 0
## BedroomAbvGr 0
## KitchenAbvGr 0
## TotRmsAbvGrd 0
## Fireplaces 0
## GarageCars 0
## GarageArea 0
## WoodDeckSF 0
## OpenPorchSF 0
## EnclosedPorch 0
## X3SsnPorch 0
## ScreenPorch 0
## PoolArea 0
## MiscVal 0
## MoSold 0
## SalePrice 0
Below is a mosaic of small bar charts that explore the distribution of each variables.
Below table shows the correlation and statistical significant indication between variables in term of numbers. We see OverallQual, GarageArea, GarageCars, X1stFlrSF and TotalBsmtSF all have very high correlation with the SalePrice target variable.
## MoSold SalePrice
## MSSubClass -0.013584643 -0.08428414
## LotFrontage 0.017726148 0.33716172
## LotArea 0.001204988 0.26384335
## OverallQual 0.070815172 0.79098160
## OverallCond -0.003510839 -0.07785589
## BsmtFinSF1 -0.015726948 0.38641981
## BsmtFinSF2 -0.015210738 -0.01137812
## BsmtUnfSF 0.034888443 0.21447911
## TotalBsmtSF 0.013196179 0.61358055
## X1stFlrSF 0.031371560 0.60585218
## X2ndFlrSF 0.035164427 0.31933380
## LowQualFinSF -0.022173961 -0.02560613
## GrLivArea 0.050239681 0.70862448
## BsmtFullBath -0.025360894 0.22712223
## BsmtHalfBath 0.032872705 -0.01684415
## FullBath 0.055872129 0.56066376
## HalfBath -0.009049888 0.28410768
## BedroomAbvGr 0.046543860 0.16821315
## KitchenAbvGr 0.026588907 -0.13590737
## TotRmsAbvGrd 0.036907077 0.53372316
## Fireplaces 0.046357102 0.46692884
## GarageCars 0.040521730 0.64040920
## GarageArea 0.027973800 0.62343144
## WoodDeckSF 0.021011044 0.32441344
## OpenPorchSF 0.071254885 0.31585623
## EnclosedPorch -0.028887266 -0.12857796
## X3SsnPorch 0.029473795 0.04458367
## ScreenPorch 0.023216992 0.11144657
## PoolArea -0.033736640 0.09240355
## MiscVal -0.006494550 -0.02118958
## MoSold 1.000000000 0.04643225
## SalePrice 0.046432245 1.00000000
Box plot was also used to explore potential outlier within the dataset.