## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Introduction

As a final project for class Data 605 (Fundamentals of Computational Mathematics) we will explore questions pertaining to Probability, Descriptive and Inferential Statistics, Linear Algegra and Correlation, Calculus based Probability & Statistics, and modeling. For this exploration we will use the data set which is part of the “the House Prices: Advanced Regression Techniques competition” on Kaggle.com, see link below.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Data

In order to load the .csv file pertaining to this competition, we registered to www.kaggle.com and downloaded the data for the “Advanced Regression Techniques Competition” (train.csv). We will assume that the data resides in the working directory for the remaining of the analysis.

# Load raw data set 
my_data <- read.csv(file="train.csv",head=TRUE,sep=",")

Let us performed some basic exploration of the data. This data set has 81 variables and 1460 observations. Based on the descriptions of the various variables (see data set text documentation), we may conclude that the dependent variable is the SalePrice. The remaining variables are both qualitative or quantitative in nature.

#display top and bottom few raws
head(my_data)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000
tail(my_data)
##        Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1455 1455         20       FV          62    7500   Pave  Pave      Reg
## 1456 1456         60       RL          62    7917   Pave  <NA>      Reg
## 1457 1457         20       RL          85   13175   Pave  <NA>      Reg
## 1458 1458         70       RL          66    9042   Pave  <NA>      Reg
## 1459 1459         20       RL          68    9717   Pave  <NA>      Reg
## 1460 1460         20       RL          75    9937   Pave  <NA>      Reg
##      LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1455         Lvl    AllPub    Inside       Gtl      Somerst       Norm
## 1456         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 1457         Lvl    AllPub    Inside       Gtl       NWAmes       Norm
## 1458         Lvl    AllPub    Inside       Gtl      Crawfor       Norm
## 1459         Lvl    AllPub    Inside       Gtl        NAmes       Norm
## 1460         Lvl    AllPub    Inside       Gtl      Edwards       Norm
##      Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1455       Norm     1Fam     1Story           7           5      2004
## 1456       Norm     1Fam     2Story           6           5      1999
## 1457       Norm     1Fam     1Story           6           6      1978
## 1458       Norm     1Fam     2Story           7           9      1941
## 1459       Norm     1Fam     1Story           5           6      1950
## 1460       Norm     1Fam     1Story           5           6      1965
##      YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1455         2005     Gable  CompShg     VinylSd     VinylSd       None
## 1456         2000     Gable  CompShg     VinylSd     VinylSd       None
## 1457         1988     Gable  CompShg     Plywood     Plywood      Stone
## 1458         2006     Gable  CompShg     CemntBd     CmentBd       None
## 1459         1996       Hip  CompShg     MetalSd     MetalSd       None
## 1460         1965     Gable  CompShg     HdBoard     HdBoard       None
##      MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1455          0        Gd        TA      PConc       Gd       TA
## 1456          0        TA        TA      PConc       Gd       TA
## 1457        119        TA        TA     CBlock       Gd       TA
## 1458          0        Ex        Gd      Stone       TA       Gd
## 1459          0        TA        TA     CBlock       TA       TA
## 1460          0        Gd        TA     CBlock       TA       TA
##      BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 1455           No          GLQ        410          Unf          0
## 1456           No          Unf          0          Unf          0
## 1457           No          ALQ        790          Rec        163
## 1458           No          GLQ        275          Unf          0
## 1459           Mn          GLQ         49          Rec       1029
## 1460           No          BLQ        830          LwQ        290
##      BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1455       811        1221    GasA        Ex          Y      SBrkr
## 1456       953         953    GasA        Ex          Y      SBrkr
## 1457       589        1542    GasA        TA          Y      SBrkr
## 1458       877        1152    GasA        Ex          Y      SBrkr
## 1459         0        1078    GasA        Gd          Y      FuseA
## 1460       136        1256    GasA        Gd          Y      SBrkr
##      X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1455      1221         0            0      1221            1            0
## 1456       953       694            0      1647            0            0
## 1457      2073         0            0      2073            1            0
## 1458      1188      1152            0      2340            0            0
## 1459      1078         0            0      1078            1            0
## 1460      1256         0            0      1256            1            0
##      FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1455        2        0            2            1          Gd            6
## 1456        2        1            3            1          TA            7
## 1457        2        0            3            1          TA            7
## 1458        2        0            4            1          Gd            9
## 1459        1        0            2            1          Gd            5
## 1460        1        1            3            1          TA            6
##      Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 1455        Typ          0        <NA>     Attchd        2004          RFn
## 1456        Typ          1          TA     Attchd        1999          RFn
## 1457       Min1          2          TA     Attchd        1978          Unf
## 1458        Typ          2          Gd     Attchd        1941          RFn
## 1459        Typ          0        <NA>     Attchd        1950          Unf
## 1460        Typ          0        <NA>     Attchd        1965          Fin
##      GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 1455          2        400         TA         TA          Y          0
## 1456          2        460         TA         TA          Y          0
## 1457          2        500         TA         TA          Y        349
## 1458          1        252         TA         TA          Y          0
## 1459          1        240         TA         TA          Y        366
## 1460          1        276         TA         TA          Y        736
##      OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1455         113             0          0           0        0   <NA>
## 1456          40             0          0           0        0   <NA>
## 1457           0             0          0           0        0   <NA>
## 1458          60             0          0           0        0   <NA>
## 1459           0           112          0           0        0   <NA>
## 1460          68             0          0           0        0   <NA>
##      Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1455  <NA>        <NA>       0     10   2009       WD        Normal
## 1456  <NA>        <NA>       0      8   2007       WD        Normal
## 1457 MnPrv        <NA>       0      2   2010       WD        Normal
## 1458 GdPrv        Shed    2500      5   2010       WD        Normal
## 1459  <NA>        <NA>       0      4   2010       WD        Normal
## 1460  <NA>        <NA>       0      6   2008       WD        Normal
##      SalePrice
## 1455    185000
## 1456    175000
## 1457    210000
## 1458    266500
## 1459    142125
## 1460    147500

From the display, we observed that quite a few independent variables have missing observations as indicated by NA’s. These will have to be accounted for if one of these variables should prove to be under consideration. Let us now run the summary function on the data to obtain basic statistics.

# Summary function on Data Set
summary(my_data)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
##  (Other):   2                 (Other): 19                   
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
##                                                              
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
##                      Wood  :  3                                      
##                                                                      
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
##  NA's: 37                      NA's:  38                     
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
##                                                                     
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                              
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
##                                                              
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
##                   Typ :1360                               NA's   : 81  
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
##  NA's   :81                                                              
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                           
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
##                                                                 
##    Fence      MiscFeature    MiscVal             MoSold      
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                           Max.   :15500.00   Max.   :12.000  
##                                                              
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9

For the remaining of the analysis, we need to select one if the independent quantitative variables (one requirement is that the distribution of this variable is skewed to the right). We will plot histograms of various independent quantitative variables to determine the shape of the distribution. Please refer to appendix A for the type of variables). We will first consider quantitative variables with no missiing values (no NA’s).

Let us consider the following variables: LotArea, BsmtFinSF1, BsmtUnfSF, TotalBsmtSF, X1stFlrSF, GrLivArea, GarageArea

hist(my_data$LotArea)

hist(my_data$BsmtFinSF1)

hist(my_data$BsmtUnfSF)

hist(my_data$TotalBsmtSF)

hist(my_data$X1stFlrSF)

hist(my_data$GrLivArea)

hist(my_data$GarageArea)

We will consider TotalBsmtSF as our X independent Qualitative variable. From the histogram we can tell that this variable distribution is skewed to the right. We will verify this by calculating the mean and median and comparing them.

my_data %>% summarise(mean_g = mean(TotalBsmtSF), median_g = median(TotalBsmtSF))
##     mean_g median_g
## 1 1057.429    991.5
summary(my_data$TotalBsmtSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.0  1298.0  6110.0

Since mean > median for this variable, the distribution is skeewed to the right. Also, looking at the summary statistics, there is no missing value for the variable ‘TotalBsmtSF’ (Total Basement Square Feet) which makes it a good choice for X variable.

Probabilities

Let us denote X the independent variable TotalBsmtSF (Total square feet of basement area) and Y the dependent variable SalePrice (the property’s sale price in dollars). Let us assume that x is estimated at the 3rd quartile of X variable and y is estimated at 2nd quartile of Y variable.

We will now calculate the following:

  1. P(X>x | Y>y), this represents the probability that value of X is above the 3rd Quartile given that Y is above the 2nd Quartile. Hence this represents the probability that the Total Basement Square Feet is above the 3rd Quartile given that the Sale Price is above the 2nd Quartil.

By definition of the quartiles, we would have P(X<= x) = .75 and P(Y<= y) = .50, this would therefore implies that P(Xx|Y>y)=$

Since we have P(Y>y), the question is to determined the probability of P(X>x & Y>y), It highly unlikely that the 2 events X>x and Y>y are independent event since the initial premise of the data set is that they might be some correlation between the SalePrice (Y variable) and the other variables.

To determine P(X>x & Y>y) we will use the data we have. Probabilities calculation will be rounded to 2 decimal point.

# find x and y
x <- quantile(my_data$TotalBsmtSF)[4]
y <- quantile(my_data$SalePrice)[3]

# counts with respect of x and y 
below_x <- my_data %>% filter(TotalBsmtSF <= x) %>% summarise(n())
below_strictly_x <- my_data %>% filter(TotalBsmtSF < x) %>% summarise(n())
above_x <- my_data %>% filter(TotalBsmtSF > x) %>% summarise(n())

below_y <- my_data %>% filter(SalePrice <= y) %>% summarise(n())
above_y <- my_data %>% filter(SalePrice > y) %>% summarise(n())


# First row of Grid
below_x_below_y <- my_data %>% filter(TotalBsmtSF <= x , SalePrice <= y) %>% summarise(n())
below_x_above_y <- my_data %>% filter(TotalBsmtSF <= x , SalePrice > y) %>% summarise(n())
below_strictly_x_above_y <- my_data %>% filter(TotalBsmtSF < x , SalePrice > y) %>% summarise(n())

# Second row of Grid
above_x_below_y <- my_data %>% filter(TotalBsmtSF > x , SalePrice <= y) %>% summarise(n())
above_x_above_y <- my_data %>% filter(TotalBsmtSF > x , SalePrice > y) %>% summarise(n())


total <- nrow(my_data)


# Sanity Check P(X <= x) and P(X > x) calculated
p_below_x <- round(below_x / total,2)
p_below_strictly_x <- round(below_strictly_x/total, 2)
p_above_x <- round(above_x / total,2)

# Calculate P(Y>y)
p_above_y <- round(above_y/total,2)

p_above_x_above_y <- round(above_x_above_y / total, 2)
p_below_strictly_x_above_y <- round(below_strictly_x_above_y / total, 2)

Based on our calculation, we have P(X>x & Y>y)= 0.23. Hence, substituing back in the formula, we obtain the following; P(X>x | Y>y) = 0.46.

  1. P(X>x, Y>y), this represents the probability that the Total Basement Square Feet of house is above the 3rd Quartil and that the Sale Price of the house is above the 2nd Quartil. This has already been calculated above; 0.23.

  2. P(Xy), this represents the probability that the Total Basement square Feet of hourse is strictly below the 3rd Quartil given that the Sale Price of the house is above the 2nd Quartil. Based on our data set, X<x has same count as X<= x.

\(P(X<x\quad |\quad Y>y)\quad =\quad \frac { P(X<x\quad \& \quad Y>y) }{ P(Y>y) }\)
\(P(X<x\quad |\quad Y>y)\quad =\quad\) 0.54.

x/y below 2nd Qtrl above 2nd Qtrl Total
below 3rd Qtrl 696 399 1095
above 3rd Qtrl 36 329 365
Total 732 728 1460

Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.

Hence A = X>x and B=Y>y, should A and B be independent, then knowing B should not impact the probability of A, hence if A and B are independent, P(A|B)=P(A).

This could be derive from the conditional probability formula:
\(P(A|B)=\frac { P(A\cap B) }{ P(B) }\), since when A and B are independent we have \(P(A\cap B)=P(A)\cdot P(B)\), this would lead to the following:

If A and B are idependent, we have \(P(A|B)=\frac { P(A)\cdot P(B) }{ P(B) } \quad \Leftrightarrow \quad P(A|B)=P(A)\)

Let us verify by calculation:

P(A|B) = 0.45
P(A) = 0.25
Since these 2 values are not the same. The variables A and B are not independent.

# Build contegency table for Variable A and B and run Chi-Square Test

m_tbl <- table(my_data$TotalBsmtSF > x, my_data$SalePrice>y)
m_tst <- chisq.test(m_tbl)

Let H0: A and B are independent
Ha: A and B are not independent

The result of the Chi-Square test indicates that the p value is extremely small and < 0.05, hence we reject the H0 hypothesis.

Results of Chi-Square test:

m_tst
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  m_tbl
## X-squared = 313.61, df = 1, p-value < 2.2e-16

Descriptive and Inferential Statistics

Sill considering variables X and Y selected above, provide univariate statistics. On X and Y as define above, let us run some basic statistics on each variable.

mean_X <-round(mean(my_data$TotalBsmtSF), 2)
sd_X <- round(sd(my_data$TotalBsmtSF),4)

median_X <-round(median(my_data$TotalBsmtSF), 2)

mean_Y <- round(mean(my_data$SalePrice),2)
sd_Y <- round(sd(my_data$SalePrice),4)

median_Y <- round(median(my_data$SalePrice), 2)

t_obs <- nrow(my_data)

The mean for X is 1057.43 and Standard Deviation is 438.7053, the median is 991.5.
The mean for Y is 1.80921210^{5} and Standard Deviation is 7.944250310^{4}, the median is 1.6310^{5}.

# Box Plots
ggplot(my_data, aes(x=1, y=TotalBsmtSF)) + geom_boxplot() + scale_x_continuous(breaks = NULL) + theme(axis.title.x = element_blank())

ggplot(my_data, aes(x=1, y=SalePrice)) + geom_boxplot() + scale_x_continuous(breaks = NULL) + scale_y_continuous(labels = comma) + theme(axis.title.x = element_blank())

# Histograms
ggplot(my_data, aes(x=TotalBsmtSF)) + geom_histogram(binwidth = 20)

ggplot(my_data, aes(x=SalePrice)) + geom_histogram(binwidth = 10) + scale_x_continuous(labels = comma)

From the box plot and histograms, there appeared to be outliers for high value.

Let us now look at the scatter plot for X and Y. Because of the concentration of the data points in lower part of the graph, we will use transparency.

sp <- ggplot(my_data, aes(x=TotalBsmtSF, y=SalePrice)) 
sp + geom_point(alpha = 0.2, colour = 'blue')

From the scatter plot, there is a strong positive relationship between the Total Basement Square Footage and the Sale Price of the house.

result <- t.test(my_data$TotalBsmtSF, my_data$SalePrice, alternative = 'two.sided', paired = TRUE, conf.level = 0.95)

From the result of the calculation we a have a [-1.839283410^{5}], -1.757991910^{5}] for the differnce of the mean between X and Y, hence given difference of mean for a sample of X and Y, we would be 95% confident that the difference would be in this interval.

We will now find a correlation matrix for X and Y and test the correlation between these 2 variables with 99 confidence interval.

H0: correlation between X and Y is 0
Ha: correlation between X and Y is not 0

# MASS Package over wrote dplyr select function 
df <- dplyr::select(my_data,TotalBsmtSF, SalePrice)
m_A <- cor(df)

c_A <- cor.test(my_data$TotalBsmtSF, my_data$SalePrice, conf.level = 0.99)
c_A
## 
##  Pearson's product-moment correlation
## 
## data:  my_data$TotalBsmtSF and my_data$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.5697562 0.6539251
## sample estimates:
##       cor 
## 0.6135806

The p-value for the correlation test is very small (9.484229410^{-152}), less than 0.05, therefore we would reject the H0 hypothesis and conclude that the correlation is not zero. As a rule of thumb, if \(|r|\quad \ge \quad \frac { 2 }{ \sqrt { n } }\), where r is correlation and n is sample size, then a relationship exists. Since our sample size is so large, clearly we have:
\(|r|\quad \ge \quad \frac { 2 }{ \sqrt { n } }\) since 0.6135806 \(\ge\) 0.0523424

We can conclude that we have a positive relationship between X (Total Basement Square feet) and Y (Sale Price).

Linear Algebra and Correlation

We will now conduct PCA on the correlation matrix. Since the 2 variables we have selected for the analysis are measure in different units, it is preferrable to use correlation matrix as a basis for PCA (correlation being a standardized measure).

Let us find the inverse of the correlation matrix, precision matrix.

m_A_inv <- solve(m_A)

M1 <- m_A %*% m_A_inv
M2 <- m_A_inv %*% m_A

Correlation Matrix = 1, 0.6135806, 0.6135806, 1, inverse matrix; Precision Matrix = 1.6038006, -0.9840609, -0.9840609, 1.6038006.

Since the 2 matrices were inverse of each other we would expect the product of the correlation matrix with the precision matrix to the Identity matrix.
Indeed, we have Correlation_Matrix X Precision_Matrix = 1, 0, 0, 1 and Precision_Matrix x Correlation_Matrix = 1, 0, 0, 1.

PCA or Principal Component Analysis is a data reduction technique that, as we understand it, is an iterative process project the data points on a vector in such a way we preserve maximum variability, once the first one is found, we need to find the next one such as the 2nd vector is orthogonal to the first and also maximize remaining variability, and so one. Hence this process reduces the number of observed variables to a smaller number of principal components which account for most of the variance of the observed variables.

From the correlation matrix, we will calculate the eigenvectors and eigenvalue for the matrix. The highest eigenvalue will indicate the most variability and will correspond to the eignvector that represents the first component.

# Eigen vectors and Eigen values of correlation matrix
m_A_eigen <- eigen(m_A)
#m_A_inv_eigen <- eigen(m_A_inv)

m_A_eigen
## $values
## [1] 1.6135806 0.3864194
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068
#m_A_inv_eigen

From the eigen values, 1.6135806, 0.3864194, we can conclude that the first principal component is given by: 0.7071068, 0.7071068 is the first component and -0.7071068, 0.7071068.

PCA on more variables from Data Set

We will conduct PCA on quantitative variables from the training set as follows:

LotFrontage, we will impute missing data with 0,
LotArea,
YearBuilt,
YearRemodAdd,
MasVnrArea, we will impute mission data with 0,
BsmtFinSF1,
BsmtFinSF2,
BsmtUnfSF,
TotalBsmtSF,
X1stFlrSF,
X2ndFlrSF,
LowQualFinSF,
GrLivArea,
BsmtFullBath,
BsmtHalfBath,
FullBath,
HalfBath,
BedroomAbvGr,
KitchenAbvGr,
TotRmsAbvGrd,
Fireplaces,
GarageYrBlt, impute missing data with house build year
GarageCars,
GarageArea

We will first build this new data set and then impute the missing data for these variables as indicated.

df1 <- dplyr::select(my_data, LotFrontage, LotArea, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2,BsmtUnfSF, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea)

# impute missing data
df1$LotFrontage[is.na(df1$LotFrontage)]<-0
df1$MasVnrArea[is.na(df1$MasVnrArea)]<-0
df1 <- df1 %>% mutate(GarageYrBlt = ifelse(is.na(GarageYrBlt), YearBuilt, GarageYrBlt))

We will perform PCA and this data set.

prin_comp <- prcomp(df1, center = TRUE, scale. = TRUE)

Let us now examine some key results of PCA, first we will plot the variables based on a graph with PC1 and PC2 as axis.

biplot(prin_comp, scale = 0)

Unfortunately, the results are difficult to interpret for variability along PC1. We would expect X2ndFlrSF, BedroomAbvGr, TotRmsAbvGrd, GrLivArea, BsmFinSF1 and BsmtFullBath (possibly). To get confirmation, we will look at the rotation matrix.

prin_comp$rotation
##                       PC1          PC2         PC3          PC4
## LotFrontage  -0.107942712  0.031879350  0.11578258 -0.201992974
## LotArea      -0.116287525 -0.023751975  0.29321734  0.001608433
## YearBuilt    -0.250777046 -0.214371658 -0.33094074  0.083738271
## YearRemodAdd -0.222274311 -0.107477888 -0.29501259  0.049595993
## MasVnrArea   -0.213317465 -0.028376777  0.02565112  0.048500894
## BsmtFinSF1   -0.149087754 -0.311156932  0.31586683  0.269215581
## BsmtFinSF2    0.015173227 -0.070534416  0.18889936  0.056799637
## BsmtUnfSF    -0.126321173  0.138107898 -0.19960478 -0.556690778
## TotalBsmtSF  -0.276650525 -0.210326435  0.19680776 -0.259927710
## X1stFlrSF    -0.278448687 -0.149066580  0.27368130 -0.291096099
## X2ndFlrSF    -0.148658060  0.415466735 -0.06283344  0.333536969
## LowQualFinSF  0.014675123  0.129056351  0.12126231 -0.065103163
## GrLivArea    -0.326986077  0.247413326  0.16036640  0.056897790
## BsmtFullBath -0.077602016 -0.311686918  0.26363529  0.288699967
## BsmtHalfBath  0.013227881  0.002926211  0.06334389  0.033836542
## FullBath     -0.286504039  0.140195572 -0.11350494 -0.059246660
## HalfBath     -0.135802999  0.207823705 -0.13666140  0.414387939
## BedroomAbvGr -0.132845034  0.379561926  0.14494678 -0.004296945
## KitchenAbvGr  0.005442026  0.171565697  0.13734063 -0.135947137
## TotRmsAbvGrd -0.268648739  0.333118647  0.14998861  0.000976127
## Fireplaces   -0.195478268  0.004670823  0.21057776  0.094437226
## GarageYrBlt  -0.258308258 -0.186244169 -0.36318499  0.058798422
## GarageCars   -0.313815483 -0.090553435 -0.14110813  0.006250847
## GarageArea   -0.308738507 -0.122143433 -0.08063520 -0.010309663
##                       PC5          PC6         PC7          PC8
## LotFrontage  -0.103636085 -0.067201700  0.15390070  0.329392373
## LotArea       0.236359298 -0.072518277 -0.01232332 -0.204829830
## YearBuilt    -0.032688850 -0.067068008 -0.09232849 -0.002264289
## YearRemodAdd -0.030787946 -0.196932570 -0.02746442  0.113779586
## MasVnrArea    0.087224641  0.391865445  0.06102250  0.056790217
## BsmtFinSF1   -0.166865837  0.196098966  0.02437687  0.210171895
## BsmtFinSF2    0.253120072 -0.659634433 -0.34182048 -0.274986345
## BsmtUnfSF     0.141362403  0.053072713  0.15906132 -0.134162389
## TotalBsmtSF   0.061976295  0.014770039  0.05985784 -0.017742031
## X1stFlrSF     0.009027335  0.022615952 -0.02810272 -0.046147775
## X2ndFlrSF     0.011148682  0.001507863  0.06173446 -0.015776719
## LowQualFinSF -0.019839890 -0.455036236  0.36879121  0.572627737
## GrLivArea     0.014066933 -0.024214014  0.06473399  0.005929383
## BsmtFullBath -0.300669644 -0.131968773  0.07673101 -0.042339087
## BsmtHalfBath  0.482343579  0.183877377 -0.58071360  0.531546262
## FullBath     -0.147252695 -0.092999464 -0.16917358 -0.025685709
## HalfBath      0.141687951  0.036432511  0.13891434 -0.076187189
## BedroomAbvGr -0.079060589 -0.040432742 -0.15373392  0.042135331
## KitchenAbvGr -0.547981979  0.101616129 -0.44645276 -0.087939153
## TotRmsAbvGrd -0.108527031 -0.034258483 -0.05234999  0.009544590
## Fireplaces    0.346647662  0.124606598  0.20593676 -0.244546866
## GarageYrBlt  -0.061795252 -0.136855369 -0.08603007  0.054550678
## GarageCars    0.010532411 -0.002114496 -0.06463626 -0.001460012
## GarageArea   -0.003173496 -0.004776191 -0.03641875  0.042071145
##                       PC9         PC10        PC11        PC12
## LotFrontage  -0.740855966  0.218644481 -0.27560663  0.06319602
## LotArea      -0.054720216  0.606120579  0.38396756 -0.50937189
## YearBuilt     0.085162076  0.068424524 -0.15266104 -0.14086430
## YearRemodAdd  0.167117397  0.275137501 -0.24754450  0.01866748
## MasVnrArea   -0.055823971 -0.456713779 -0.09385641 -0.62667075
## BsmtFinSF1    0.090385507  0.065431024 -0.08456485  0.04855522
## BsmtFinSF2   -0.180384063 -0.346929982 -0.16858494 -0.08852110
## BsmtUnfSF     0.062823406 -0.025842925 -0.07073996 -0.03865640
## TotalBsmtSF   0.090914826 -0.085576035 -0.22115875 -0.02100544
## X1stFlrSF     0.119169822 -0.074325141 -0.09493369  0.08674646
## X2ndFlrSF    -0.053708283  0.047031298 -0.01262683  0.01194043
## LowQualFinSF  0.310992688 -0.195093018  0.28786619 -0.13763508
## GrLivArea     0.071831077 -0.033661985 -0.05369416  0.06100173
## BsmtFullBath  0.024655177  0.021441649 -0.08901473  0.01333057
## BsmtHalfBath  0.069981100  0.057232064 -0.01924082  0.10050585
## FullBath      0.229745149  0.176203649 -0.05362661 -0.01588288
## HalfBath     -0.140774563 -0.127792579 -0.11876591 -0.04919915
## BedroomAbvGr  0.002330621  0.078813298 -0.15052306 -0.05268625
## KitchenAbvGr  0.048000051 -0.115364910  0.19120823 -0.02267002
## TotRmsAbvGrd  0.036696599 -0.013580197 -0.05276266  0.06127635
## Fireplaces    0.176902793  0.005092835  0.04178891  0.44632165
## GarageYrBlt   0.032590204  0.033102955  0.05043677 -0.07158271
## GarageCars   -0.227454612 -0.134685895  0.44390069  0.18209946
## GarageArea   -0.268523106 -0.161858248  0.45846837  0.15392849
##                      PC13         PC14         PC15         PC16
## LotFrontage  -0.207972882  0.205523963 -0.140465378  0.012716860
## LotArea      -0.113313253 -0.071758102 -0.009754141 -0.005664877
## YearBuilt    -0.065106514 -0.026519516 -0.443896821  0.052086020
## YearRemodAdd -0.219717279  0.079388664  0.516162019 -0.426223864
## MasVnrArea   -0.018364906  0.340535683  0.062477255 -0.212310967
## BsmtFinSF1    0.078802494 -0.058356074 -0.008577452  0.258757311
## BsmtFinSF2   -0.027312609  0.108438835  0.036929580  0.052434830
## BsmtUnfSF    -0.066237395 -0.274797047  0.047255071 -0.104135822
## TotalBsmtSF   0.005168619 -0.297572305  0.052257738  0.183410779
## X1stFlrSF    -0.019099960 -0.078868766  0.051850172  0.111034017
## X2ndFlrSF     0.021034730  0.068277482  0.229187754  0.158183755
## LowQualFinSF -0.157843923  0.009210298 -0.137971779 -0.021789710
## GrLivArea    -0.011182931 -0.000450640  0.215770185  0.211076602
## BsmtFullBath  0.058252338 -0.151852444  0.062355931 -0.370097158
## BsmtHalfBath -0.086150713 -0.106163602  0.026090363 -0.055805339
## FullBath      0.108149166  0.429133463 -0.043555824  0.382796154
## HalfBath     -0.368746084 -0.502846195 -0.109356533  0.137758013
## BedroomAbvGr  0.501113606 -0.170596049 -0.359861980 -0.396217027
## KitchenAbvGr -0.553574027 -0.043707592 -0.078208061 -0.079749131
## TotRmsAbvGrd  0.077652127 -0.060247335  0.040215394 -0.126763253
## Fireplaces   -0.302838824  0.357911165 -0.336810648 -0.279803722
## GarageYrBlt  -0.004387633 -0.037261395 -0.305675983  0.025034339
## GarageCars    0.111972636 -0.016712775  0.073564141 -0.094852577
## GarageArea    0.165819335 -0.060787380  0.132828171 -0.029261784
##                      PC17         PC18         PC19        PC20
## LotFrontage  -0.051037863  0.005475805  0.013715636 -0.01812178
## LotArea       0.024064367 -0.016827673 -0.004725283  0.01955012
## YearBuilt    -0.035467928 -0.099113366  0.018824562  0.23502603
## YearRemodAdd  0.300977521  0.168981474 -0.088678446 -0.01864759
## MasVnrArea   -0.006284062  0.020509997  0.022116404 -0.01570259
## BsmtFinSF1    0.214156061  0.112912459 -0.284168415  0.13748492
## BsmtFinSF2    0.029911285  0.028095552 -0.111488395  0.04684240
## BsmtUnfSF    -0.347440305  0.030843961 -0.086336945  0.02390129
## TotalBsmtSF  -0.116298913  0.158786397 -0.423389758  0.18423387
## X1stFlrSF     0.274889795 -0.135130803  0.443859683 -0.33345728
## X2ndFlrSF    -0.275354611 -0.168735994 -0.371653865 -0.08179725
## LowQualFinSF -0.003365334  0.075743919 -0.014539185  0.01723154
## GrLivArea    -0.026822555 -0.232577794  0.016454069 -0.31167571
## BsmtFullBath -0.619777913 -0.038689215  0.242737308 -0.08272200
## BsmtHalfBath -0.242152704 -0.045321188  0.078201077 -0.01548062
## FullBath     -0.259363907  0.424019339  0.264157612  0.02415114
## HalfBath      0.101015017  0.313587530  0.287169968 -0.04497212
## BedroomAbvGr  0.177230936  0.284024621 -0.142328722 -0.22249894
## KitchenAbvGr  0.003057937  0.054618219 -0.182253289 -0.06122493
## TotRmsAbvGrd  0.133695037 -0.445394789  0.215122494  0.62912394
## Fireplaces   -0.020352044  0.037222203 -0.138441601 -0.01846767
## GarageYrBlt   0.024637103 -0.440476006 -0.160445473 -0.31391361
## GarageCars   -0.010009629  0.230848471  0.082287116  0.28395220
## GarageArea    0.034145290  0.047090485 -0.082787015 -0.18513167
##                      PC21          PC22          PC23          PC24
## LotFrontage   0.007395706 -0.0216147861  3.018241e-16 -1.005889e-16
## LotArea       0.005511807 -0.0046389545  2.906665e-16 -3.815000e-16
## YearBuilt    -0.570089389  0.3302550110  2.646796e-16  2.155836e-16
## YearRemodAdd  0.003824775  0.0287699387 -6.903257e-16  3.154166e-16
## MasVnrArea    0.047978832 -0.0235906161 -2.095595e-16  2.402605e-16
## BsmtFinSF1    0.066273099 -0.0424492938  1.737657e-02  5.781645e-01
## BsmtFinSF2    0.016701886 -0.0048305951  6.145992e-03  2.044935e-01
## BsmtUnfSF     0.035240262  0.0086463349  1.683439e-02  5.601247e-01
## TotalBsmtSF   0.110536332 -0.0371998682 -1.671393e-02 -5.561169e-01
## X1stFlrSF    -0.180714536 -0.0539374655  4.913397e-01 -1.476707e-02
## X2ndFlrSF    -0.200335459 -0.0227150421  5.548126e-01 -1.667473e-02
## LowQualFinSF -0.038898585  0.0115785131  6.179826e-02 -1.857329e-03
## GrLivArea    -0.302971381 -0.0574794846 -6.678674e-01  2.007256e-02
## BsmtFullBath  0.024727093 -0.0009081554 -2.269157e-17  1.780140e-17
## BsmtHalfBath  0.014043163  0.0068453432  4.346230e-17 -1.198273e-16
## FullBath      0.269022580  0.0348893208 -8.329774e-17  4.827869e-17
## HalfBath      0.189717516  0.0026948617 -7.188683e-17 -2.086940e-18
## BedroomAbvGr -0.052448710 -0.0317913225  1.055906e-16  3.319138e-16
## KitchenAbvGr -0.044747163  0.0070674175 -5.494251e-17  1.157658e-16
## TotRmsAbvGrd  0.272985109  0.1015726004  5.653153e-17 -2.483031e-16
## Fireplaces    0.068296834  0.0285671714  7.625148e-17 -1.981755e-17
## GarageYrBlt   0.432531080 -0.3449819791 -3.237769e-17 -2.096795e-16
## GarageCars   -0.276625560 -0.5696061949  7.910578e-17  1.135158e-16
## GarageArea    0.177905934  0.6496013645 -6.126922e-17 -1.283654e-17

Let us now examin the scree plot and cummulative scree plots.

#compute standard deviation of each principal component
std_dev <- prin_comp$sdev

#compute variance
pr_var <- std_dev^2

#check variance of first 10 components
pr_var[1:10]
##  [1] 6.1042470 3.1050505 2.2238539 1.8753994 1.1963679 1.0489798 1.0379231
##  [8] 0.9777722 0.9202734 0.8185473
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]
##  [1] 0.254343624 0.129377106 0.092660578 0.078141644 0.049848661
##  [6] 0.043707492 0.043246795 0.040740510 0.038344725 0.034106136
## [11] 0.033947264 0.028265587 0.025747604 0.024712960 0.020003264
## [16] 0.016665549 0.013363184 0.009332517 0.008403620 0.006091195
#scree plot
plot(prop_varex, xlab = "Principal Component",  ylab = "Proportion of Variance Explained",type = "b")

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", type = "b")

From the scree plots, it appears that over 92% of the variability can be explained with 15 principals components.

Calculus-Based Probability & Statistics

We will now fit a closed form distribution to the data from our variable X. Since we will fit an exponential probability density function, we need to ensure that the variable X has value over interval \(\left[ 0\quad ,\quad \infty \right)\).

Let us therefore consider the minimum for this distribution, 0, since minimum is greater or equal to 0, the distribution is within the interval and there is no need to shift to the right.

# Fitting to exponential distribution
fd <- fitdistr(my_data$TotalBsmtSF, 'exponential')

fd_est <- fd$estimate

We will now take a random sample of 1000 observation with same distribution and compare histogram of this distribution with the one of our original variable.

r_X <- rexp(1000, fd_est)

hist(r_X)

In constrast, let us look at the histogram for our X variable (Total Basement Square Feet).

hist(my_data$TotalBsmtSF)

The variable Total Basemenent Square feet have a sizeable number of observations at value of 0 since, some houses do not have a basement. However, barring this spkike at value 0, the distribution for the variable resemble more a normal distribution skeewed to the right due to some outliers with high value.

5th and 95th Percentiles

  1. Using the CDF of Exponential distriubtion

From the result of the fitdistr function, we know that the best fitting exponential probability distribution has a rate of 9.456895710^{-4}. Hence this would lead to the following PDF:
\(f\left( x \right) =\lambda { e }^{ -\lambda x }\) where \(\lambda\) = 9.456895710^{-4}.

Hence, its CPDF would be given by: \(F(x)=1-{ e }^{ -\lambda x },\quad where\quad \lambda =0.0009456896\)

In order to find the pth percentile, we would solve setting F(x)=p. Let us find the general formula for exponential distribution.
\(F({ x }_{ p })\quad =\quad 1-{ e }^{ -\lambda { x }_{ p } }\quad =\quad p\quad \Leftrightarrow \quad { e }^{ -\lambda { x }_{ p } }\quad =\quad 1-p\),
taking the logaithm on both side, we get;
\(-\lambda { x }_{ p }\quad =\quad \ln { (1-p)\quad \Leftrightarrow } \quad { x }_{ p }=\frac { -1 }{ \lambda } \ln { (1-p),\quad with\quad } 0\le p<1\)

Let us now find the 5th and 95th percentils.

5th & 95th percentils

p = 0.05 and p = 0.95, substituing p and \(\lambda\) in formula we derived we obtain:

f_perc <- function (p, l_rte){
  xp <- round(-1/l_rte*(log(1-p)),2)
  return(xp)
}

xp_5 <- f_perc(0.05, fd_est)
xp_95 <- f_perc(0.95, fd_est)

xp_5
##  rate 
## 54.24
xp_95
##    rate 
## 3167.78

For the Exponential distribution, we have for 5th percentile; 54.24 and for 95th pecentile; 3167.78. Hence we have 90% of the data points falling between these 2 values.

Let us now consider the empirical data we have. First we will build a 95% interval (for the mean) assuming normal distribution and then find the 5th and 95th percentile.

# 95% confidence interval assuming normal distribution

error <- qnorm(0.975)*sd_X/sqrt(t_obs)

lower_bound <- mean_X - error
upper_bound <- mean_X + error

Confidence interval: 1034.9268 to 1079.9332.

# Percentiles

xp_empirical <- quantile(my_data$TotalBsmtSF, c(.05, .95))

xp_empirical
##     5%    95% 
##  519.3 1753.0

The results confirm that although the distribution for variable X is skeweed to the right, its distribution is more fitting a normal distriubtion (with the exception of the spike at value 0) than an exponential distribution.

Regression Model

we want to be able Sale Price based on selections of some (of all) independent variables. From our previous analysis and from survey from few people in Real Estate, we will consider the following variables are possible predictors. We should note that some are quantitative and some are categorical.

From observations of the data we will exclude variables since most have NA’s value or have a single value ‘Street’, ‘Alley’, ‘Utilities’, ‘CentralAir’, ‘FireplaceQu’, ‘PoolQC’, ‘Fence’, ‘MiscFeature’, ‘GarageQual’, ‘GarageCond’

From result of surveys and from PCA components, we will add variables to our model;

GrLivArea (0.5021)
TotRmsAbvGrd (0.5097)
GarageCars (0.6293)
LotArea (0.6331)
YearBuilt (0.6922)
YearRemodAdd (0.7038)
Neighborhood (0.7763) OverallCond (0.7855)
Foundation (0.7875) BedroomAbvGr (0.7933) FullBath (0.7932) When variable FullBath was added, Adj. R2 values went done. We will replace this variables by another
BsmtFinType1 (0.7972) BsmtFinSF1 (0.8008) BsmtFinType2 (0.8012)
BsmtFinSF2 (0.8017)
LowQualFinSF (0.8018)
MSSubClass (0.8168) MSZoning (0.8173) LandContour (0.8198) Condition1 (0.8219) Condition2 (0.8263) BldgType (0.8296) HouseStyle (0.8323) RoofMatl (0.8687) Exterior1st (0.8719) Exterior2nd (0.8733) ExterQual (0.8851) ExterCond (0.8852) BsmtQual (0.8915) BsmtCond (0.8914) When this variable is added, Adj. R2 values went done, we will replace by another Heating (0.8913) When this variable is added, Adj. R2 values went done, we will replace by another CentralAir (0.8916)
X1stFlrSF (0.8915) When this variable is added, Adj. R2 values went done, we will replace by another KitchenAbvGr (0.8925) KitchenQual (0.8964) PavedDrive (0.8962) When this variable is added, Adj. R2 values went done, we will replace by another
PoolArea (0.8979)

attach(my_data)

my.lm <- lm(SalePrice~GrLivArea)

my.lm_p <- lm(SalePrice~GrLivArea + TotRmsAbvGrd + GarageCars + LotArea + YearBuilt + YearRemodAdd + Neighborhood + OverallCond + Foundation + BedroomAbvGr + BsmtFinType1 + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 + LowQualFinSF + MSSubClass + MSZoning + LandContour + Condition1 + Condition2 + BldgType + HouseStyle + RoofMatl + Exterior1st + Exterior2nd + ExterQual + ExterCond + BsmtQual + CentralAir + X1stFlrSF + KitchenAbvGr + KitchenQual + PoolArea)

As we add the variables we will check the Adjusted -R2 value, we will repeat process as long as we have increasing Adj. R2 values.

summary(my.lm_p)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + TotRmsAbvGrd + GarageCars + 
##     LotArea + YearBuilt + YearRemodAdd + Neighborhood + OverallCond + 
##     Foundation + BedroomAbvGr + BsmtFinType1 + BsmtFinSF1 + BsmtFinType2 + 
##     BsmtFinSF2 + LowQualFinSF + MSSubClass + MSZoning + LandContour + 
##     Condition1 + Condition2 + BldgType + HouseStyle + RoofMatl + 
##     Exterior1st + Exterior2nd + ExterQual + ExterCond + BsmtQual + 
##     CentralAir + X1stFlrSF + KitchenAbvGr + KitchenQual + PoolArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -178551  -10503     417   10767  178551 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.570e+06  1.790e+05  -8.773  < 2e-16 ***
## GrLivArea            7.328e+01  5.222e+00  14.034  < 2e-16 ***
## TotRmsAbvGrd         1.327e+03  1.013e+03   1.310 0.190396    
## GarageCars           7.972e+03  1.367e+03   5.832 6.91e-09 ***
## LotArea              5.665e-01  8.736e-02   6.485 1.26e-10 ***
## YearBuilt            4.693e+02  7.213e+01   6.505 1.11e-10 ***
## YearRemodAdd         4.975e+01  5.673e+01   0.877 0.380644    
## NeighborhoodBlueste  1.803e+03  2.060e+04   0.088 0.930267    
## NeighborhoodBrDale   2.117e+03  1.151e+04   0.184 0.854076    
## NeighborhoodBrkSide -1.072e+04  9.819e+03  -1.092 0.274952    
## NeighborhoodClearCr -2.077e+04  9.701e+03  -2.141 0.032480 *  
## NeighborhoodCollgCr -1.315e+04  7.548e+03  -1.742 0.081688 .  
## NeighborhoodCrawfor  5.344e+03  8.947e+03   0.597 0.550423    
## NeighborhoodEdwards -2.631e+04  8.309e+03  -3.166 0.001582 ** 
## NeighborhoodGilbert -1.804e+04  8.040e+03  -2.244 0.025020 *  
## NeighborhoodIDOTRR  -1.115e+04  1.116e+04  -0.999 0.318149    
## NeighborhoodMeadowV -1.647e+04  1.170e+04  -1.408 0.159336    
## NeighborhoodMitchel -2.910e+04  8.537e+03  -3.409 0.000672 ***
## NeighborhoodNAmes   -2.454e+04  8.117e+03  -3.023 0.002548 ** 
## NeighborhoodNoRidge  2.540e+04  8.674e+03   2.928 0.003470 ** 
## NeighborhoodNPkVill  1.008e+04  1.491e+04   0.676 0.499229    
## NeighborhoodNridgHt  2.235e+04  7.621e+03   2.933 0.003413 ** 
## NeighborhoodNWAmes  -2.598e+04  8.364e+03  -3.106 0.001936 ** 
## NeighborhoodOldTown -2.177e+04  1.001e+04  -2.174 0.029878 *  
## NeighborhoodSawyer  -2.250e+04  8.510e+03  -2.644 0.008285 ** 
## NeighborhoodSawyerW -1.392e+04  8.200e+03  -1.697 0.089915 .  
## NeighborhoodSomerst  5.735e+03  9.224e+03   0.622 0.534223    
## NeighborhoodStoneBr  3.796e+04  8.651e+03   4.388 1.24e-05 ***
## NeighborhoodSWISU   -1.401e+04  1.020e+04  -1.374 0.169555    
## NeighborhoodTimber  -1.555e+04  8.653e+03  -1.797 0.072575 .  
## NeighborhoodVeenker  3.386e+02  1.091e+04   0.031 0.975252    
## OverallCond          7.152e+03  8.917e+02   8.020 2.36e-15 ***
## FoundationCBlock     2.093e+03  3.328e+03   0.629 0.529523    
## FoundationPConc      5.243e+03  3.658e+03   1.433 0.151961    
## FoundationStone      3.662e+03  1.120e+04   0.327 0.743664    
## FoundationWood      -2.374e+04  1.555e+04  -1.527 0.127013    
## BedroomAbvGr        -4.678e+03  1.432e+03  -3.268 0.001112 ** 
## BsmtFinType1BLQ      3.206e+03  2.932e+03   1.093 0.274468    
## BsmtFinType1GLQ      8.215e+03  2.655e+03   3.094 0.002021 ** 
## BsmtFinType1LwQ     -6.355e+03  3.923e+03  -1.620 0.105533    
## BsmtFinType1Rec     -8.277e+02  3.146e+03  -0.263 0.792545    
## BsmtFinType1Unf      7.097e+03  3.078e+03   2.306 0.021296 *  
## BsmtFinSF1           2.503e+01  2.881e+00   8.688  < 2e-16 ***
## BsmtFinType2BLQ     -2.212e+04  8.023e+03  -2.757 0.005911 ** 
## BsmtFinType2GLQ     -9.404e+03  9.929e+03  -0.947 0.343710    
## BsmtFinType2LwQ     -2.337e+04  7.811e+03  -2.992 0.002825 ** 
## BsmtFinType2Rec     -1.940e+04  7.444e+03  -2.606 0.009265 ** 
## BsmtFinType2Unf     -1.679e+04  7.987e+03  -2.102 0.035782 *  
## BsmtFinSF2           1.238e+01  8.387e+00   1.476 0.140241    
## LowQualFinSF        -4.802e+01  1.867e+01  -2.572 0.010235 *  
## MSSubClass          -3.801e+01  8.845e+01  -0.430 0.667447    
## MSZoningFV           3.771e+04  1.245e+04   3.029 0.002506 ** 
## MSZoningRH           3.644e+04  1.262e+04   2.888 0.003948 ** 
## MSZoningRL           3.660e+04  1.060e+04   3.453 0.000571 ***
## MSZoningRM           3.272e+04  9.821e+03   3.331 0.000889 ***
## LandContourHLS       1.970e+04  5.464e+03   3.606 0.000323 ***
## LandContourLow       9.540e+02  6.710e+03   0.142 0.886963    
## LandContourLvl       7.663e+03  3.881e+03   1.975 0.048523 *  
## Condition1Feedr     -3.247e+02  5.357e+03  -0.061 0.951685    
## Condition1Norm       7.750e+03  4.398e+03   1.762 0.078263 .  
## Condition1PosA      -2.067e+03  1.064e+04  -0.194 0.846005    
## Condition1PosN       3.305e+03  7.875e+03   0.420 0.674798    
## Condition1RRAe      -1.944e+04  9.540e+03  -2.038 0.041781 *  
## Condition1RRAn       3.592e+03  7.261e+03   0.495 0.620902    
## Condition1RRNe      -5.971e+03  1.907e+04  -0.313 0.754258    
## Condition1RRNn      -3.056e+03  1.336e+04  -0.229 0.819127    
## Condition2Feedr      5.860e+03  2.372e+04   0.247 0.804938    
## Condition2Norm       3.286e+03  2.040e+04   0.161 0.872047    
## Condition2PosA       3.301e+04  3.894e+04   0.848 0.396764    
## Condition2PosN      -2.084e+05  2.852e+04  -7.308 4.75e-13 ***
## Condition2RRAe      -2.506e+04  3.402e+04  -0.737 0.461506    
## Condition2RRAn      -6.333e+03  3.326e+04  -0.190 0.849016    
## Condition2RRNn       1.278e+04  2.833e+04   0.451 0.651956    
## BldgType2fmCon       3.039e+03  1.336e+04   0.228 0.820008    
## BldgTypeDuplex      -4.252e+03  7.599e+03  -0.559 0.575924    
## BldgTypeTwnhs       -2.913e+04  1.053e+04  -2.766 0.005750 ** 
## BldgTypeTwnhsE      -2.117e+04  9.560e+03  -2.215 0.026939 *  
## HouseStyle1.5Unf     1.669e+04  7.999e+03   2.086 0.037147 *  
## HouseStyle1Story     7.289e+03  4.521e+03   1.612 0.107186    
## HouseStyle2.5Fin    -1.027e+04  1.264e+04  -0.813 0.416596    
## HouseStyle2.5Unf     7.904e+03  9.427e+03   0.839 0.401900    
## HouseStyle2Story    -3.458e+03  3.666e+03  -0.943 0.345622    
## HouseStyleSFoyer     9.796e+03  6.708e+03   1.460 0.144420    
## HouseStyleSLvl       4.052e+03  5.605e+03   0.723 0.469802    
## RoofMatlCompShg      6.273e+05  3.233e+04  19.401  < 2e-16 ***
## RoofMatlMembran      6.885e+05  4.293e+04  16.037  < 2e-16 ***
## RoofMatlMetal        6.721e+05  4.213e+04  15.952  < 2e-16 ***
## RoofMatlRoll         6.400e+05  4.239e+04  15.097  < 2e-16 ***
## RoofMatlTar&Grv      6.155e+05  3.309e+04  18.598  < 2e-16 ***
## RoofMatlWdShake      6.332e+05  3.483e+04  18.180  < 2e-16 ***
## RoofMatlWdShngl      7.132e+05  3.373e+04  21.142  < 2e-16 ***
## Exterior1stBrkComm  -5.874e+04  3.472e+04  -1.692 0.090933 .  
## Exterior1stBrkFace   1.479e+04  1.394e+04   1.061 0.289008    
## Exterior1stCBlock   -1.316e+04  2.924e+04  -0.450 0.652598    
## Exterior1stCemntBd   7.059e+02  2.038e+04   0.035 0.972372    
## Exterior1stHdBoard  -3.280e+03  1.398e+04  -0.235 0.814609    
## Exterior1stImStucc  -4.748e+04  3.013e+04  -1.576 0.115258    
## Exterior1stMetalSd   2.640e+03  1.571e+04   0.168 0.866575    
## Exterior1stPlywood  -7.526e+03  1.389e+04  -0.542 0.587946    
## Exterior1stStone    -2.638e+04  2.588e+04  -1.020 0.308088    
## Exterior1stStucco    1.382e+03  1.531e+04   0.090 0.928093    
## Exterior1stVinylSd  -1.582e+04  1.439e+04  -1.100 0.271724    
## Exterior1stWd Sdng  -7.139e+03  1.349e+04  -0.529 0.596709    
## Exterior1stWdShing  -3.073e+03  1.457e+04  -0.211 0.833024    
## Exterior2ndAsphShn   8.376e+03  2.291e+04   0.366 0.714681    
## Exterior2ndBrk Cmn   8.643e+03  2.208e+04   0.391 0.695513    
## Exterior2ndBrkFace   6.267e+03  1.459e+04   0.430 0.667536    
## Exterior2ndCBlock           NA         NA      NA       NA    
## Exterior2ndCmentBd   6.577e+03  2.023e+04   0.325 0.745149    
## Exterior2ndHdBoard   5.028e+03  1.364e+04   0.369 0.712535    
## Exterior2ndImStucc   2.595e+04  1.553e+04   1.671 0.094995 .  
## Exterior2ndMetalSd   2.014e+03  1.548e+04   0.130 0.896463    
## Exterior2ndOther     2.080e+04  2.973e+04   0.700 0.484287    
## Exterior2ndPlywood   6.638e+03  1.335e+04   0.497 0.619169    
## Exterior2ndStone    -3.131e+04  2.152e+04  -1.455 0.145870    
## Exterior2ndStucco    1.056e+04  1.463e+04   0.722 0.470465    
## Exterior2ndVinylSd   2.133e+04  1.396e+04   1.528 0.126690    
## Exterior2ndWd Sdng   1.264e+04  1.317e+04   0.960 0.337459    
## Exterior2ndWd Shng   4.504e+03  1.366e+04   0.330 0.741644    
## ExterQualFa         -2.506e+04  1.116e+04  -2.246 0.024860 *  
## ExterQualGd         -3.018e+04  5.070e+03  -5.953 3.39e-09 ***
## ExterQualTA         -3.407e+04  5.587e+03  -6.098 1.42e-09 ***
## ExterCondFa         -1.483e+04  1.949e+04  -0.761 0.446769    
## ExterCondGd         -2.118e+04  1.839e+04  -1.152 0.249613    
## ExterCondPo         -4.037e+04  3.291e+04  -1.227 0.220169    
## ExterCondTA         -1.877e+04  1.837e+04  -1.022 0.307083    
## BsmtQualFa          -2.726e+04  6.461e+03  -4.219 2.62e-05 ***
## BsmtQualGd          -2.723e+04  3.507e+03  -7.765 1.66e-14 ***
## BsmtQualTA          -2.543e+04  4.338e+03  -5.861 5.83e-09 ***
## CentralAirY          4.252e+03  3.840e+03   1.107 0.268392    
## X1stFlrSF            4.673e+00  5.846e+00   0.799 0.424299    
## KitchenAbvGr        -2.115e+04  5.813e+03  -3.638 0.000286 ***
## KitchenQualFa       -2.764e+04  6.548e+03  -4.221 2.60e-05 ***
## KitchenQualGd       -2.632e+04  3.663e+03  -7.186 1.13e-12 ***
## KitchenQualTA       -2.677e+04  4.157e+03  -6.440 1.68e-10 ***
## PoolArea             8.589e+01  1.897e+01   4.528 6.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25360 on 1287 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.9076, Adjusted R-squared:  0.8979 
## F-statistic: 94.29 on 134 and 1287 DF,  p-value: < 2.2e-16

We will now fit our model to the test data set.

tst_data <- read.csv(file="test.csv",head=TRUE,sep=",")

result_data <- predict(my.lm_p, tst_data, interval="predict")

I have not been able to run this model: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor Foundation has new levels Slab

I am not able to resolve all the categorical variables. I have check both data set and they have the same levels.

Kaggle user name: vbriot