1. Data Exploration

a. Read Data In

The dataset of interest for this report is the real estate prices from kaggle.com.
The dataset is stored in a github repostiory and read into R programing environment directly. The first 6 rows of the data given are as below. The target variable is under “SalePrice” column that shows the dollar values of the houses last sold.

##   Id MSSubClass MSZoning LotFrontage LotArea Street
## 1  1         60       RL          65    8450   Pave
## 2  2         20       RL          80    9600   Pave
## 3  3         60       RL          68   11250   Pave
## 4  4         70       RL          60    9550   Pave
## 5  5         60       RL          84   14260   Pave
## 6  6         50       RL          85   14115   Pave

A corresponding short description of the predictors variables are given as below.

## # A tibble: 81 x 2
##    Columns_Name Short_Desc                                  
##    <fct>        <fct>                                       
##  1 Id           RowINdex                                    
##  2 MSSubClass   Building Class                              
##  3 MSZoning     Zoning Classification                       
##  4 LotFrontage  Linear feet of street connnected to property
##  5 LotArea      Lot size in square feet                     
##  6 Street       Type of road access                         
##  7 Alley        Type of alley access                        
##  8 LotShape     General shape of property                   
##  9 LandContour  Flatness of the property                    
## 10 Utilities    Type of utilities available                 
## 11 LotConfig    Lot Configuration                           
## 12 LandSlope    Slope of property                           
## 13 Neighborhood Physical locations within Ames city limits  
## # ... with 68 more rows

b. Check for missing and NA

The data is checked to see if it contains any blank or NA values.

##    na_count    col_names
## 17     1453       PoolQC
## 19     1406  MiscFeature
## 2      1369        Alley
## 18     1179        Fence
## 11      690  FireplaceQu
## 1       259  LotFrontage
## 12       81   GarageType
## 13       81  GarageYrBlt
## 14       81 GarageFinish
## 15       81   GarageQual
## 16       81   GarageCond
## 7        38 BsmtExposure
## 9        38 BsmtFinType2
## 5        37     BsmtQual
## 6        37     BsmtCond
## 8        37 BsmtFinType1
## 3         8   MasVnrType
## 4         8   MasVnrArea
## 10        1   Electrical

There are quite a few categorical variables as shown in below examples.

#unique(housetrain$BldgType)
#unique(housetrain$MiscFeature)
#unique(housetrain$PoolQC)      
#unique(housetrain$MiscFeature)     
#unique(housetrain$Alley)       
#unique(housetrain$Fence)       
#unique(housetrain$FireplaceQu)     
#unique(housetrain$GarageType)      
unique(housetrain$GarageYrBlt)      
##  [1] 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006
## [15] 1960 1991 1970 1967 1958 1930 2002 1968 2007 2008 1957 1920 1966 1959
## [29] 1995 1954 1953   NA 1983 1977 1997 1985 1963 1981 1964 1999 1935 1990
## [43] 1945 1987 1989 1915 1956 1948 1974 2009 1950 1961 1921 1900 1979 1951
## [57] 1969 1936 1975 1971 1923 1984 1926 1955 1986 1988 1916 1932 1972 1918
## [71] 1980 1924 1996 1940 1949 1994 1910 1978 1982 1992 1925 1941 2010 1927
## [85] 1947 1937 1942 1938 1952 1928 1922 1934 1906 1914 1946 1908 1929 1933
unique(housetrain$GarageFinish)     
## [1] RFn  Unf  Fin  <NA>
## Levels: Fin RFn Unf
unique(housetrain$GarageQual)
## [1] TA   Fa   Gd   <NA> Ex   Po  
## Levels: Ex Fa Gd Po TA
#unique(crimetrain$LotFrontage)

c. Separate categorical data and numerical data

We will separate out the categorical data columns with missing variables and create a new data frame.

##   Id MSSubClass MSZoning LotFrontage LotArea Street
## 1  1         60       RL          65    8450   Pave
## 2  2         20       RL          80    9600   Pave
## 3  3         60       RL          68   11250   Pave
## 4  4         70       RL          60    9550   Pave
## 5  5         60       RL          84   14260   Pave
## 6  6         50       RL          85   14115   Pave

The next step is separating out columns with numerical values only. Date data such as “YearBuilt” are considered as categorical data and excluded. Date data can be transformed into numerical value by calculating the year length.

housetrain2 <- house_train[ ,!sapply(house_train, is.factor)]
housetrain2 <- housetrain2[, -which(colnames(housetrain2)%in% c('YearBuilt','YearRemodAdd','YrSold','Id'))]

housetrain2[1:6,1:6]
##   MSSubClass LotFrontage LotArea OverallQual OverallCond BsmtFinSF1
## 1         60          65    8450           7           5        706
## 2         20          80    9600           6           8        978
## 3         60          68   11250           7           5        486
## 4         70          60    9550           7           5        216
## 5         60          84   14260           8           5        655
## 6         50          85   14115           5           5        732

d. Impute missing value using MICE

Below plot shows there are missing values in “LotFrontage”; 18% of them are missing.

aggr_plot <- aggr(housetrain2, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##       Variable     Count
##    LotFrontage 0.1773973
##     MSSubClass 0.0000000
##        LotArea 0.0000000
##    OverallQual 0.0000000
##    OverallCond 0.0000000
##     BsmtFinSF1 0.0000000
##     BsmtFinSF2 0.0000000
##      BsmtUnfSF 0.0000000
##    TotalBsmtSF 0.0000000
##      X1stFlrSF 0.0000000
##      X2ndFlrSF 0.0000000
##   LowQualFinSF 0.0000000
##      GrLivArea 0.0000000
##   BsmtFullBath 0.0000000
##   BsmtHalfBath 0.0000000
##       FullBath 0.0000000
##       HalfBath 0.0000000
##   BedroomAbvGr 0.0000000
##   KitchenAbvGr 0.0000000
##   TotRmsAbvGrd 0.0000000
##     Fireplaces 0.0000000
##     GarageCars 0.0000000
##     GarageArea 0.0000000
##     WoodDeckSF 0.0000000
##    OpenPorchSF 0.0000000
##  EnclosedPorch 0.0000000
##     X3SsnPorch 0.0000000
##    ScreenPorch 0.0000000
##       PoolArea 0.0000000
##        MiscVal 0.0000000
##         MoSold 0.0000000
##      SalePrice 0.0000000

“MICE” package is used to calculate the missing value, specifically, predictive mean matching was used as an imputation method. The graph now show there is no more missing value.

housetrain3 <- mice(housetrain2, m = 5, maxit =50, method = 'pmm', seed = 500)
housetrain4 <- complete(housetrain3,1)

aggr_plot <- aggr(housetrain4, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##       Variable Count
##     MSSubClass     0
##    LotFrontage     0
##        LotArea     0
##    OverallQual     0
##    OverallCond     0
##     BsmtFinSF1     0
##     BsmtFinSF2     0
##      BsmtUnfSF     0
##    TotalBsmtSF     0
##      X1stFlrSF     0
##      X2ndFlrSF     0
##   LowQualFinSF     0
##      GrLivArea     0
##   BsmtFullBath     0
##   BsmtHalfBath     0
##       FullBath     0
##       HalfBath     0
##   BedroomAbvGr     0
##   KitchenAbvGr     0
##   TotRmsAbvGrd     0
##     Fireplaces     0
##     GarageCars     0
##     GarageArea     0
##     WoodDeckSF     0
##    OpenPorchSF     0
##  EnclosedPorch     0
##     X3SsnPorch     0
##    ScreenPorch     0
##       PoolArea     0
##        MiscVal     0
##         MoSold     0
##      SalePrice     0

e. Examine distribution of variables

Below is a mosaic of small bar charts that explore the distribution of each variables.

f. Correlation to target variable plot

Below table shows the correlation and statistical significant indication between variables in term of numbers. We see OverallQual, GarageArea, GarageCars, X1stFlrSF and TotalBsmtSF all have very high correlation with the SalePrice target variable.

##                     MoSold   SalePrice
## MSSubClass    -0.013584643 -0.08428414
## LotFrontage    0.017726148  0.33716172
## LotArea        0.001204988  0.26384335
## OverallQual    0.070815172  0.79098160
## OverallCond   -0.003510839 -0.07785589
## BsmtFinSF1    -0.015726948  0.38641981
## BsmtFinSF2    -0.015210738 -0.01137812
## BsmtUnfSF      0.034888443  0.21447911
## TotalBsmtSF    0.013196179  0.61358055
## X1stFlrSF      0.031371560  0.60585218
## X2ndFlrSF      0.035164427  0.31933380
## LowQualFinSF  -0.022173961 -0.02560613
## GrLivArea      0.050239681  0.70862448
## BsmtFullBath  -0.025360894  0.22712223
## BsmtHalfBath   0.032872705 -0.01684415
## FullBath       0.055872129  0.56066376
## HalfBath      -0.009049888  0.28410768
## BedroomAbvGr   0.046543860  0.16821315
## KitchenAbvGr   0.026588907 -0.13590737
## TotRmsAbvGrd   0.036907077  0.53372316
## Fireplaces     0.046357102  0.46692884
## GarageCars     0.040521730  0.64040920
## GarageArea     0.027973800  0.62343144
## WoodDeckSF     0.021011044  0.32441344
## OpenPorchSF    0.071254885  0.31585623
## EnclosedPorch -0.028887266 -0.12857796
## X3SsnPorch     0.029473795  0.04458367
## ScreenPorch    0.023216992  0.11144657
## PoolArea      -0.033736640  0.09240355
## MiscVal       -0.006494550 -0.02118958
## MoSold         1.000000000  0.04643225
## SalePrice      0.046432245  1.00000000

Box plot was also used to explore potential outlier within the dataset.